2016-10-03 21:20:40

by Shaohua Li

[permalink] [raw]
Subject: [PATCH V3 00/11] block-throttle: add .high limit

Hi,

The background is we don't have an ioscheduler for blk-mq yet, so we can't
prioritize processes/cgroups. This patch set tries to add basic arbitration
between cgroups with blk-throttle. It adds a new limit io.high for
blk-throttle. It's only for cgroup2.

io.max is a hard limit throttling. cgroups with a max limit never dispatch more
IO than their max limit. While io.high is a best effort throttling. cgroups
with high limit can run above their high limit at appropriate time.
Specifically, if all cgroups reach their high limit, all cgroups can run above
their high limit. If any cgroup runs under its high limit, all other cgroups
will run according to their high limit.

An example usage is we have a high prio cgroup with high high limit and a low
prio cgroup with low high limit. If the high prio cgroup isn't running, the low
prio can run above its high limit, so we don't waste the bandwidth. When the
high prio cgroup runs and is below its high limit, low prio cgroup will run
under its high limit. This will protect high prio cgroup to get more resources.
If both cgroups reach their high limit, both can run above their high limit
(eg, fully utilize disk bandwidth). All these can't be done with io.max limit.

The implementation is simple. The disk queue has 2 states LIMIT_HIGH and
LIMIT_MAX. In each disk state, we throttle cgroups according to the limit of
the state. That is io.high limit for LIMIT_HIGH state, io.max limit for
LIMIT_MAX. The disk state can be upgraded/downgraded between
LIMIT_HIGH/LIMIT_MAX according to the rule above. Initially disk state is
LIMIT_MAX. And if no cgroup sets io.high, the disk state will remain in
LIMIT_MAX state. Users with only io.max set will find nothing changed with the
patches.

The first 8 patches implement the basic framework. Add interface, handle
upgrade and downgrade logic. The patch 8 detects a special case a cgroup is
completely idle. In this case, we ignore the cgroup's limit. The patch 9-11
adds more heuristics.

The basic framework has 2 major issues.
1. fluctuation. When the state is upgraded from LIMIT_HIGH to LIMIT_MAX, the
cgroup's bandwidth can change dramatically, sometimes in a way not expected.
For example, one cgroup's bandwidth will drop below its io.high limit very soon
after a upgrade. patch 9 has more details about the issue.
2. idle cgroup. cgroup with a io.high limit doesn't always dispatch enough IO.
In above upgrade rule, the disk will remain in LIMIT_HIGH state and all other
cgroups can't dispatch more IO above their high limit. Hence this is a waste of
disk bandwidth. patch 10 has more details about the issue.

For issue 1, we make cgroup bandwidth increase smoothly after a upgrade. This
will reduce the chance a cgroup's bandwidth drop under its high limit rapidly.
The smoothness means we could waste some bandwidth in the transition though.
But we must pay something for sharing.

The issue 2 is very hard to solve. To be honest, I don't have a good solution
yet. The patch 10 uses the 'think time check' idea borrowed from CFQ to detect
idle cgroup. It's not perfect, eg, not works well for high IO depth workloads.
But it's the best I tried so far and in practice works well. This definitively
needs more tuning.

Please review, test and consider merge.

Thanks,
Shaohua

V2->V3:
- Rebase
- Fix several bugs
- Make harddisk think time threshold bigger

V1->V2:
- Drop io.low interface for simplicity and the interface isn't a must-have to
prioritize cgroups.
- Remove the 'trial' logic, which creates too much fluctuation
- Add a new idle cgroup detection
- Other bug fixes and improvements
http://marc.info/?l=linux-block&m=147395674732335&w=2

V1:
http://marc.info/?l=linux-block&m=146292596425689&w=2


Shaohua Li (11):
block-throttle: prepare support multiple limits
block-throttle: add .high interface
block-throttle: configure bps/iops limit for cgroup in high limit
block-throttle: add upgrade logic for LIMIT_HIGH state
block-throttle: add downgrade logic
blk-throttle: make sure expire time isn't too big
blk-throttle: make throtl_slice tunable
blk-throttle: detect completed idle cgroup
block-throttle: make bandwidth change smooth
block-throttle: add a simple idle detection
blk-throttle: ignore idle cgroup limit

block/bio.c | 2 +
block/blk-sysfs.c | 18 ++
block/blk-throttle.c | 715 +++++++++++++++++++++++++++++++++++++++++-----
block/blk.h | 9 +
include/linux/blk_types.h | 1 +
5 files changed, 680 insertions(+), 65 deletions(-)

--
2.9.3


2016-10-03 21:20:50

by Shaohua Li

[permalink] [raw]
Subject: [PATCH v3 10/11] block-throttle: add a simple idle detection

A cgroup gets assigned a high limit, but the cgroup could never dispatch
enough IO to cross the high limit. In such case, the queue state machine
will remain in LIMIT_HIGH state and all other cgroups will be throttled
according to high limit. This is unfair for other cgroups. We should
treat the cgroup idle and upgrade the state machine to higher state.

We also have a downgrade logic. If the state machine upgrades because of
cgroup idle (real idle), the state machine will downgrade soon as the
cgroup is below its high limit. This isn't what we want. A more
complicated case is cgroup isn't idle when queue is in LIMIT_HIGH. But
when queue gets upgraded to higher state, other cgroups could dispatch
more IO and this cgroup can't dispatch enough IO, so the cgroup is below
its high limit and looks like idle (fake idle). In this case, the queue
should downgrade soon. The key to determine if we should do downgrade is
to detect if cgroup is truely idle.

Unfortunately it's very hard to determine if a cgroup is real idle. This
patch uses the 'think time check' idea from CFQ for the purpose. Please
note, the idea doesn't work for all workloads. For example, a workload
with io depth 8 has disk utilization 100%, hence think time is 0, eg,
not idle. But the workload can run higher bandwidth with io depth 16.
Compared to io depth 16, the io depth 8 workload is idle. We use the
idea to roughly determine if a cgroup is idle.

We treat a cgroup idle if its think time is above a threshold (by
default 50us for SSD and 1ms for HD). The idea is think time above the
threshold will start to harm performance. HD is much slower so a longer
think time is ok. There is a knob to let user configure the threshold
too.

Signed-off-by: Shaohua Li <[email protected]>
---
block/bio.c | 2 +
block/blk-sysfs.c | 7 ++++
block/blk-throttle.c | 96 ++++++++++++++++++++++++++++++++++++++++++++++-
block/blk.h | 6 +++
include/linux/blk_types.h | 1 +
5 files changed, 111 insertions(+), 1 deletion(-)

diff --git a/block/bio.c b/block/bio.c
index aa73540..06e414c 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -30,6 +30,7 @@
#include <linux/cgroup.h>

#include <trace/events/block.h>
+#include "blk.h"

/*
* Test patch to inline a certain number of bi_io_vec's inside the bio
@@ -1758,6 +1759,7 @@ void bio_endio(struct bio *bio)
goto again;
}

+ blk_throtl_bio_endio(bio);
if (bio->bi_end_io)
bio->bi_end_io(bio);
}
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 610f08d..209b67c 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -532,6 +532,12 @@ static struct queue_sysfs_entry throtl_slice_entry = {
.show = blk_throtl_slice_show,
.store = blk_throtl_slice_store,
};
+
+static struct queue_sysfs_entry throtl_idle_threshold_entry = {
+ .attr = {.name = "throttling_idle_threshold", .mode = S_IRUGO | S_IWUSR },
+ .show = blk_throtl_idle_threshold_show,
+ .store = blk_throtl_idle_threshold_store,
+};
#endif

static struct attribute *default_attrs[] = {
@@ -563,6 +569,7 @@ static struct attribute *default_attrs[] = {
&queue_dax_entry.attr,
#ifdef CONFIG_BLK_DEV_THROTTLING
&throtl_slice_entry.attr,
+ &throtl_idle_threshold_entry.attr,
#endif
NULL,
};
diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index 1658b13..e8a2f31 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -21,6 +21,8 @@ static int throtl_quantum = 32;
/* Throttling is performed over 100ms slice and after that slice is renewed */
#define DFL_THROTL_SLICE (HZ / 10)
#define MAX_THROTL_SLICE (HZ / 5)
+#define DFL_IDLE_THRESHOLD_SSD (50 * 1000) /* 50 us */
+#define DFL_IDLE_THRESHOLD_HD (1000 * 1000) /* 1 ms */

static struct blkcg_policy blkcg_policy_throtl;

@@ -149,6 +151,10 @@ struct throtl_grp {
/* When did we start a new slice */
unsigned long slice_start[2];
unsigned long slice_end[2];
+
+ u64 last_finish_time;
+ u64 checked_last_finish_time;
+ u64 avg_ttime;
};

struct throtl_data
@@ -172,6 +178,8 @@ struct throtl_data
unsigned long high_downgrade_time;

unsigned int scale;
+
+ u64 idle_ttime_threshold;
};

static void throtl_pending_timer_fn(unsigned long arg);
@@ -1626,6 +1634,14 @@ static unsigned long tg_last_high_overflow_time(struct throtl_grp *tg)
return ret;
}

+static bool throtl_tg_is_idle(struct throtl_grp *tg)
+{
+ /* cgroup is idle if average think time is more than threshold */
+ return ktime_get_ns() - tg->last_finish_time >
+ 4 * tg->td->idle_ttime_threshold ||
+ tg->avg_ttime > tg->td->idle_ttime_threshold;
+}
+
static bool throtl_upgrade_check_one(struct throtl_grp *tg)
{
struct throtl_service_queue *sq = &tg->service_queue;
@@ -1830,6 +1846,19 @@ static void throtl_downgrade_check(struct throtl_grp *tg)
tg->last_io_disp[WRITE] = 0;
}

+static void blk_throtl_update_ttime(struct throtl_grp *tg)
+{
+ u64 now = ktime_get_ns();
+ u64 last_finish_time = tg->last_finish_time;
+
+ if (now <= last_finish_time || last_finish_time == 0 ||
+ last_finish_time == tg->checked_last_finish_time)
+ return;
+
+ tg->avg_ttime = (tg->avg_ttime * 31 + now - last_finish_time) >> 5;
+ tg->checked_last_finish_time = last_finish_time;
+}
+
bool blk_throtl_bio(struct request_queue *q, struct blkcg_gq *blkg,
struct bio *bio)
{
@@ -1841,6 +1870,13 @@ bool blk_throtl_bio(struct request_queue *q, struct blkcg_gq *blkg,

WARN_ON_ONCE(!rcu_read_lock_held());

+ if (tg->td->idle_ttime_threshold == -1) {
+ if (blk_queue_nonrot(q))
+ tg->td->idle_ttime_threshold = DFL_IDLE_THRESHOLD_SSD;
+ else
+ tg->td->idle_ttime_threshold = DFL_IDLE_THRESHOLD_HD;
+ }
+
/* see throtl_charge_bio() */
if ((bio->bi_opf & REQ_THROTTLED) || !tg->has_rules[rw])
goto out;
@@ -1850,6 +1886,11 @@ bool blk_throtl_bio(struct request_queue *q, struct blkcg_gq *blkg,
if (unlikely(blk_queue_bypass(q)))
goto out_unlock;

+ bio_associate_current(bio);
+ bio->bi_cg_private = q;
+
+ blk_throtl_update_ttime(tg);
+
sq = &tg->service_queue;

again:
@@ -1909,7 +1950,6 @@ bool blk_throtl_bio(struct request_queue *q, struct blkcg_gq *blkg,

tg->last_high_overflow_time[rw] = jiffies;

- bio_associate_current(bio);
tg->td->nr_queued[rw]++;
throtl_add_bio_tg(bio, qn, tg);
throttled = true;
@@ -1938,6 +1978,34 @@ bool blk_throtl_bio(struct request_queue *q, struct blkcg_gq *blkg,
return throttled;
}

+void blk_throtl_bio_endio(struct bio *bio)
+{
+ struct blkcg *blkcg;
+ struct blkcg_gq *blkg;
+ struct throtl_grp *tg;
+ struct request_queue *q;
+
+ q = bio->bi_cg_private;
+ if (!q)
+ return;
+ bio->bi_cg_private = NULL;
+
+ rcu_read_lock();
+ blkcg = bio_blkcg(bio);
+ if (!blkcg)
+ goto end;
+ blkg = blkg_lookup(blkcg, q);
+ if (!blkg)
+ goto end;
+
+ tg = blkg_to_tg(blkg ?: q->root_blkg);
+
+ tg->last_finish_time = ktime_get_ns();
+
+end:
+ rcu_read_unlock();
+}
+
/*
* Dispatch all bios from all children tg's queued on @parent_sq. On
* return, @parent_sq is guaranteed to not have any active children tg's
@@ -2023,6 +2091,8 @@ int blk_throtl_init(struct request_queue *q)
td->limit_index = LIMIT_MAX;
td->high_upgrade_time = jiffies;
td->high_downgrade_time = jiffies;
+
+ td->idle_ttime_threshold = -1;
/* activate policy */
ret = blkcg_activate_policy(q, &blkcg_policy_throtl);
if (ret)
@@ -2062,6 +2132,30 @@ ssize_t blk_throtl_slice_store(struct request_queue *q,
return count;
}

+ssize_t blk_throtl_idle_threshold_show(struct request_queue *q, char *page)
+{
+ u64 threshold = q->td->idle_ttime_threshold;
+ if (!q->td)
+ return -EINVAL;
+ do_div(threshold, 1000);
+ return sprintf(page, "%lluus\n", threshold);
+}
+
+ssize_t blk_throtl_idle_threshold_store(struct request_queue *q,
+ const char *page, size_t count)
+{
+ unsigned long v;
+
+ if (!q->td)
+ return -EINVAL;
+ if (kstrtoul(page, 10, &v))
+ return -EINVAL;
+ if (v == 0)
+ return -EINVAL;
+ q->td->idle_ttime_threshold = v * 1000;
+ return count;
+}
+
static int __init throtl_init(void)
{
kthrotld_workqueue = alloc_workqueue("kthrotld", WQ_MEM_RECLAIM, 0);
diff --git a/block/blk.h b/block/blk.h
index 8ad6068..8e1aeca 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -297,10 +297,16 @@ extern void blk_throtl_exit(struct request_queue *q);
extern ssize_t blk_throtl_slice_show(struct request_queue *q, char *page);
extern ssize_t blk_throtl_slice_store(struct request_queue *q,
const char *page, size_t count);
+extern ssize_t blk_throtl_idle_threshold_show(struct request_queue *q,
+ char *page);
+extern ssize_t blk_throtl_idle_threshold_store(struct request_queue *q,
+ const char *page, size_t count);
+extern void blk_throtl_bio_endio(struct bio *bio);
#else /* CONFIG_BLK_DEV_THROTTLING */
static inline void blk_throtl_drain(struct request_queue *q) { }
static inline int blk_throtl_init(struct request_queue *q) { return 0; }
static inline void blk_throtl_exit(struct request_queue *q) { }
+static inline void blk_throtl_bio_endio(struct bio *bio) { }
#endif /* CONFIG_BLK_DEV_THROTTLING */

#endif /* BLK_INTERNAL_H */
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 436f43f..be9d10d 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -60,6 +60,7 @@ struct bio {
*/
struct io_context *bi_ioc;
struct cgroup_subsys_state *bi_css;
+ void *bi_cg_private;
#endif
union {
#if defined(CONFIG_BLK_DEV_INTEGRITY)
--
2.9.3

2016-10-03 21:20:43

by Shaohua Li

[permalink] [raw]
Subject: [PATCH v3 07/11] blk-throttle: make throtl_slice tunable

throtl_slice is important for blk-throttling. A lot of stuffes depend on
it, for example, throughput measurement. It has 100ms default value,
which is not appropriate for all disks. For example, for SSD we might
use a smaller value to make the throughput smoother. This patch makes it
tunable.

Signed-off-by: Shaohua Li <[email protected]>
---
block/blk-sysfs.c | 11 ++++++++
block/blk-throttle.c | 72 ++++++++++++++++++++++++++++++++++++----------------
block/blk.h | 3 +++
3 files changed, 64 insertions(+), 22 deletions(-)

diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index f87a7e7..610f08d 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -526,6 +526,14 @@ static struct queue_sysfs_entry queue_dax_entry = {
.show = queue_dax_show,
};

+#ifdef CONFIG_BLK_DEV_THROTTLING
+static struct queue_sysfs_entry throtl_slice_entry = {
+ .attr = {.name = "throttling_slice", .mode = S_IRUGO | S_IWUSR },
+ .show = blk_throtl_slice_show,
+ .store = blk_throtl_slice_store,
+};
+#endif
+
static struct attribute *default_attrs[] = {
&queue_requests_entry.attr,
&queue_ra_entry.attr,
@@ -553,6 +561,9 @@ static struct attribute *default_attrs[] = {
&queue_poll_entry.attr,
&queue_wc_entry.attr,
&queue_dax_entry.attr,
+#ifdef CONFIG_BLK_DEV_THROTTLING
+ &throtl_slice_entry.attr,
+#endif
NULL,
};

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index 778de0b..4263f0c 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -19,7 +19,8 @@ static int throtl_grp_quantum = 8;
static int throtl_quantum = 32;

/* Throttling is performed over 100ms slice and after that slice is renewed */
-static unsigned long throtl_slice = HZ/10; /* 100 ms */
+#define DFL_THROTL_SLICE (HZ / 10)
+#define MAX_THROTL_SLICE (HZ / 5)

static struct blkcg_policy blkcg_policy_throtl;

@@ -158,6 +159,8 @@ struct throtl_data
/* Total Number of queued bios on READ and WRITE lists */
unsigned int nr_queued[2];

+ unsigned int throtl_slice;
+
/* Work for dispatching throttled bios */
struct work_struct dispatch_work;
unsigned int limit_index;
@@ -589,7 +592,7 @@ static void throtl_dequeue_tg(struct throtl_grp *tg)
static void throtl_schedule_pending_timer(struct throtl_service_queue *sq,
unsigned long expires)
{
- unsigned long max_expire = jiffies + 8 * throtl_slice;
+ unsigned long max_expire = jiffies + 8 * sq_to_tg(sq)->td->throtl_slice;
if (time_after(expires, max_expire))
expires = max_expire;
mod_timer(&sq->pending_timer, expires);
@@ -649,7 +652,7 @@ static inline void throtl_start_new_slice_with_credit(struct throtl_grp *tg,
if (time_after_eq(start, tg->slice_start[rw]))
tg->slice_start[rw] = start;

- tg->slice_end[rw] = jiffies + throtl_slice;
+ tg->slice_end[rw] = jiffies + tg->td->throtl_slice;
throtl_log(&tg->service_queue,
"[%c] new slice with credit start=%lu end=%lu jiffies=%lu",
rw == READ ? 'R' : 'W', tg->slice_start[rw],
@@ -661,7 +664,7 @@ static inline void throtl_start_new_slice(struct throtl_grp *tg, bool rw)
tg->bytes_disp[rw] = 0;
tg->io_disp[rw] = 0;
tg->slice_start[rw] = jiffies;
- tg->slice_end[rw] = jiffies + throtl_slice;
+ tg->slice_end[rw] = jiffies + tg->td->throtl_slice;
throtl_log(&tg->service_queue,
"[%c] new slice start=%lu end=%lu jiffies=%lu",
rw == READ ? 'R' : 'W', tg->slice_start[rw],
@@ -671,13 +674,13 @@ static inline void throtl_start_new_slice(struct throtl_grp *tg, bool rw)
static inline void throtl_set_slice_end(struct throtl_grp *tg, bool rw,
unsigned long jiffy_end)
{
- tg->slice_end[rw] = roundup(jiffy_end, throtl_slice);
+ tg->slice_end[rw] = roundup(jiffy_end, tg->td->throtl_slice);
}

static inline void throtl_extend_slice(struct throtl_grp *tg, bool rw,
unsigned long jiffy_end)
{
- tg->slice_end[rw] = roundup(jiffy_end, throtl_slice);
+ tg->slice_end[rw] = roundup(jiffy_end, tg->td->throtl_slice);
throtl_log(&tg->service_queue,
"[%c] extend slice start=%lu end=%lu jiffies=%lu",
rw == READ ? 'R' : 'W', tg->slice_start[rw],
@@ -717,19 +720,19 @@ static inline void throtl_trim_slice(struct throtl_grp *tg, bool rw)
* is bad because it does not allow new slice to start.
*/

- throtl_set_slice_end(tg, rw, jiffies + throtl_slice);
+ throtl_set_slice_end(tg, rw, jiffies + tg->td->throtl_slice);

time_elapsed = jiffies - tg->slice_start[rw];

- nr_slices = time_elapsed / throtl_slice;
+ nr_slices = time_elapsed / tg->td->throtl_slice;

if (!nr_slices)
return;
- tmp = tg_bps_limit(tg, rw) * throtl_slice * nr_slices;
+ tmp = tg_bps_limit(tg, rw) * tg->td->throtl_slice * nr_slices;
do_div(tmp, HZ);
bytes_trim = tmp;

- io_trim = (tg_iops_limit(tg, rw) * throtl_slice * nr_slices)/HZ;
+ io_trim = (tg_iops_limit(tg, rw) * tg->td->throtl_slice * nr_slices)/HZ;

if (!bytes_trim && !io_trim)
return;
@@ -744,7 +747,7 @@ static inline void throtl_trim_slice(struct throtl_grp *tg, bool rw)
else
tg->io_disp[rw] = 0;

- tg->slice_start[rw] += nr_slices * throtl_slice;
+ tg->slice_start[rw] += nr_slices * tg->td->throtl_slice;

throtl_log(&tg->service_queue,
"[%c] trim slice nr=%lu bytes=%llu io=%lu start=%lu end=%lu jiffies=%lu",
@@ -764,9 +767,9 @@ static bool tg_with_in_iops_limit(struct throtl_grp *tg, struct bio *bio,

/* Slice has just started. Consider one slice interval */
if (!jiffy_elapsed)
- jiffy_elapsed_rnd = throtl_slice;
+ jiffy_elapsed_rnd = tg->td->throtl_slice;

- jiffy_elapsed_rnd = roundup(jiffy_elapsed_rnd, throtl_slice);
+ jiffy_elapsed_rnd = roundup(jiffy_elapsed_rnd, tg->td->throtl_slice);

/*
* jiffy_elapsed_rnd should not be a big value as minimum iops can be
@@ -813,9 +816,9 @@ static bool tg_with_in_bps_limit(struct throtl_grp *tg, struct bio *bio,

/* Slice has just started. Consider one slice interval */
if (!jiffy_elapsed)
- jiffy_elapsed_rnd = throtl_slice;
+ jiffy_elapsed_rnd = tg->td->throtl_slice;

- jiffy_elapsed_rnd = roundup(jiffy_elapsed_rnd, throtl_slice);
+ jiffy_elapsed_rnd = roundup(jiffy_elapsed_rnd, tg->td->throtl_slice);

tmp = tg_bps_limit(tg, rw) * jiffy_elapsed_rnd;
do_div(tmp, HZ);
@@ -880,8 +883,8 @@ static bool tg_may_dispatch(struct throtl_grp *tg, struct bio *bio,
if (throtl_slice_used(tg, rw) && !(tg->service_queue.nr_queued[rw]))
throtl_start_new_slice(tg, rw);
else {
- if (time_before(tg->slice_end[rw], jiffies + throtl_slice))
- throtl_extend_slice(tg, rw, jiffies + throtl_slice);
+ if (time_before(tg->slice_end[rw], jiffies + tg->td->throtl_slice))
+ throtl_extend_slice(tg, rw, jiffies + tg->td->throtl_slice);
}

if (tg_with_in_bps_limit(tg, bio, &bps_wait) &&
@@ -1630,7 +1633,7 @@ static bool throtl_can_upgrade(struct throtl_data *td,
if (td->limit_index != LIMIT_HIGH)
return false;

- if (time_before(jiffies, td->high_downgrade_time + throtl_slice))
+ if (time_before(jiffies, td->high_downgrade_time + td->throtl_slice))
return false;

rcu_read_lock();
@@ -1687,8 +1690,8 @@ static bool throtl_downgrade_check_one(struct throtl_grp *tg)
* If cgroup is below high limit, consider downgrade and throttle other
* cgroups
*/
- if (time_after_eq(now, td->high_upgrade_time + throtl_slice) &&
- time_after_eq(now, tg_last_high_overflow_time(tg) + throtl_slice))
+ if (time_after_eq(now, td->high_upgrade_time + td->throtl_slice) &&
+ time_after_eq(now, tg_last_high_overflow_time(tg) + td->throtl_slice))
return true;
return false;
}
@@ -1721,10 +1724,10 @@ static void throtl_downgrade_check(struct throtl_grp *tg)
return;
if (!list_empty(&tg_to_blkg(tg)->blkcg->css.children))
return;
- if (time_after(tg->last_check_time + throtl_slice, now))
+ if (time_after(tg->last_check_time + tg->td->throtl_slice, now))
return;

- if (time_before(now, tg_last_high_overflow_time(tg) + throtl_slice))
+ if (time_before(now, tg_last_high_overflow_time(tg) + tg->td->throtl_slice))
return;

elapsed_time = now - tg->last_check_time;
@@ -1962,6 +1965,7 @@ int blk_throtl_init(struct request_queue *q)

q->td = td;
td->queue = q;
+ td->throtl_slice = DFL_THROTL_SLICE;

td->limit_valid[LIMIT_HIGH] = false;
td->limit_valid[LIMIT_MAX] = true;
@@ -1983,6 +1987,30 @@ void blk_throtl_exit(struct request_queue *q)
kfree(q->td);
}

+ssize_t blk_throtl_slice_show(struct request_queue *q, char *page)
+{
+ if (!q->td)
+ return -EINVAL;
+ return sprintf(page, "%ums\n", jiffies_to_msecs(q->td->throtl_slice));
+}
+
+ssize_t blk_throtl_slice_store(struct request_queue *q,
+ const char *page, size_t count)
+{
+ unsigned long v;
+ unsigned long t;
+
+ if (!q->td)
+ return -EINVAL;
+ if (kstrtoul(page, 10, &v))
+ return -EINVAL;
+ t = msecs_to_jiffies(v);
+ if (t == 0 || t > MAX_THROTL_SLICE)
+ return -EINVAL;
+ q->td->throtl_slice = t;
+ return count;
+}
+
static int __init throtl_init(void)
{
kthrotld_workqueue = alloc_workqueue("kthrotld", WQ_MEM_RECLAIM, 0);
diff --git a/block/blk.h b/block/blk.h
index c37492f..8ad6068 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -294,6 +294,9 @@ static inline struct io_context *create_io_context(gfp_t gfp_mask, int node)
extern void blk_throtl_drain(struct request_queue *q);
extern int blk_throtl_init(struct request_queue *q);
extern void blk_throtl_exit(struct request_queue *q);
+extern ssize_t blk_throtl_slice_show(struct request_queue *q, char *page);
+extern ssize_t blk_throtl_slice_store(struct request_queue *q,
+ const char *page, size_t count);
#else /* CONFIG_BLK_DEV_THROTTLING */
static inline void blk_throtl_drain(struct request_queue *q) { }
static inline int blk_throtl_init(struct request_queue *q) { return 0; }
--
2.9.3

2016-10-03 21:21:28

by Shaohua Li

[permalink] [raw]
Subject: [PATCH v3 04/11] block-throttle: add upgrade logic for LIMIT_HIGH state

When queue is in LIMIT_HIGH state and all cgroups with high limit cross
the bps/iops limitation, we will upgrade queue's state to
LIMIT_MAX

For a cgroup hierarchy, there are two cases. Children has lower high
limit than parent. Parent's high limit is meaningless. If children's
bps/iops cross high limit, we can upgrade queue state. The other case is
children has higher high limit than parent. Children's high limit is
meaningless. As long as parent's bps/iops cross high limit, we can
upgrade queue state.

Signed-off-by: Shaohua Li <[email protected]>
---
block/blk-throttle.c | 104 +++++++++++++++++++++++++++++++++++++++++++++++++--
1 file changed, 100 insertions(+), 4 deletions(-)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index e2b3704..db2ea15 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -456,6 +456,7 @@ static void blk_throtl_update_valid_limit(struct throtl_data *td)
td->limit_valid[LIMIT_HIGH] = false;
}

+static void throtl_upgrade_state(struct throtl_data *td);
static void throtl_pd_offline(struct blkg_policy_data *pd)
{
struct throtl_grp *tg = pd_to_tg(pd);
@@ -467,9 +468,8 @@ static void throtl_pd_offline(struct blkg_policy_data *pd)

blk_throtl_update_valid_limit(tg->td);

- if (tg->td->limit_index == LIMIT_HIGH &&
- !tg->td->limit_valid[LIMIT_HIGH])
- tg->td->limit_index = LIMIT_MAX;
+ if (!tg->td->limit_valid[tg->td->limit_index])
+ throtl_upgrade_state(tg->td);
}

static void throtl_pd_free(struct blkg_policy_data *pd)
@@ -1077,6 +1077,8 @@ static int throtl_select_dispatch(struct throtl_service_queue *parent_sq)
return nr_disp;
}

+static bool throtl_can_upgrade(struct throtl_data *td,
+ struct throtl_grp *this_tg);
/**
* throtl_pending_timer_fn - timer function for service_queue->pending_timer
* @arg: the throtl_service_queue being serviced
@@ -1103,6 +1105,9 @@ static void throtl_pending_timer_fn(unsigned long arg)
int ret;

spin_lock_irq(q->queue_lock);
+ if (throtl_can_upgrade(td, NULL))
+ throtl_upgrade_state(td);
+
again:
parent_sq = sq->parent_sq;
dispatched = false;
@@ -1505,6 +1510,91 @@ static struct blkcg_policy blkcg_policy_throtl = {
.pd_free_fn = throtl_pd_free,
};

+static bool throtl_upgrade_check_one(struct throtl_grp *tg)
+{
+ struct throtl_service_queue *sq = &tg->service_queue;
+ bool read_limit, write_limit;
+
+ /* if cgroup reaches high/max limit, it's ok to next limit */
+ read_limit = tg->bps[READ][LIMIT_HIGH] != -1 ||
+ tg->iops[READ][LIMIT_HIGH] != -1 ||
+ tg->bps[READ][LIMIT_MAX] != -1 ||
+ tg->iops[READ][LIMIT_MAX] != -1;
+ write_limit = tg->bps[WRITE][LIMIT_HIGH] != -1 ||
+ tg->iops[WRITE][LIMIT_HIGH] != -1 ||
+ tg->bps[WRITE][LIMIT_MAX] != -1 ||
+ tg->iops[WRITE][LIMIT_MAX] != -1;
+ if (read_limit && sq->nr_queued[READ] &&
+ (!write_limit || sq->nr_queued[WRITE]))
+ return true;
+ if (write_limit && sq->nr_queued[WRITE] &&
+ (!read_limit || sq->nr_queued[READ]))
+ return true;
+ return false;
+}
+
+static bool throtl_upgrade_check_hierarchy(struct throtl_grp *tg)
+{
+ if (throtl_upgrade_check_one(tg))
+ return true;
+ while (true) {
+ if (!tg || (cgroup_subsys_on_dfl(io_cgrp_subsys) &&
+ !tg_to_blkg(tg)->parent))
+ return false;
+ if (throtl_upgrade_check_one(tg))
+ return true;
+ tg = sq_to_tg(tg->service_queue.parent_sq);
+ }
+ return false;
+}
+
+static bool throtl_can_upgrade(struct throtl_data *td,
+ struct throtl_grp *this_tg)
+{
+ struct cgroup_subsys_state *pos_css;
+ struct blkcg_gq *blkg;
+
+ if (td->limit_index != LIMIT_HIGH)
+ return false;
+
+ rcu_read_lock();
+ blkg_for_each_descendant_post(blkg, pos_css, td->queue->root_blkg) {
+ struct throtl_grp *tg = blkg_to_tg(blkg);
+
+ if (tg == this_tg)
+ continue;
+ if (!list_empty(&tg_to_blkg(tg)->blkcg->css.children))
+ continue;
+ if (!throtl_upgrade_check_hierarchy(tg)) {
+ rcu_read_unlock();
+ return false;
+ }
+ }
+ rcu_read_unlock();
+ return true;
+}
+
+static void throtl_upgrade_state(struct throtl_data *td)
+{
+ struct cgroup_subsys_state *pos_css;
+ struct blkcg_gq *blkg;
+
+ td->limit_index = LIMIT_MAX;
+ rcu_read_lock();
+ blkg_for_each_descendant_post(blkg, pos_css, td->queue->root_blkg) {
+ struct throtl_grp *tg = blkg_to_tg(blkg);
+ struct throtl_service_queue *sq = &tg->service_queue;
+
+ tg->disptime = jiffies - 1;
+ throtl_select_dispatch(sq);
+ throtl_schedule_next_dispatch(sq, false);
+ }
+ rcu_read_unlock();
+ throtl_select_dispatch(&td->service_queue);
+ throtl_schedule_next_dispatch(&td->service_queue, false);
+ queue_work(kthrotld_workqueue, &td->dispatch_work);
+}
+
bool blk_throtl_bio(struct request_queue *q, struct blkcg_gq *blkg,
struct bio *bio)
{
@@ -1527,14 +1617,20 @@ bool blk_throtl_bio(struct request_queue *q, struct blkcg_gq *blkg,

sq = &tg->service_queue;

+again:
while (true) {
/* throtl is FIFO - if bios are already queued, should queue */
if (sq->nr_queued[rw])
break;

/* if above limits, break to queue */
- if (!tg_may_dispatch(tg, bio, NULL))
+ if (!tg_may_dispatch(tg, bio, NULL)) {
+ if (throtl_can_upgrade(tg->td, tg)) {
+ throtl_upgrade_state(tg->td);
+ goto again;
+ }
break;
+ }

/* within limits, let's charge and dispatch directly */
throtl_charge_bio(tg, bio);
--
2.9.3

2016-10-03 21:21:39

by Shaohua Li

[permalink] [raw]
Subject: [PATCH v3 11/11] blk-throttle: ignore idle cgroup limit

Last patch introduces a way to detect idle cgroup. We use it to make
upgrade/downgrade decision.

Signed-off-by: Shaohua Li <[email protected]>
---
block/blk-throttle.c | 31 +++++++++++++++++++------------
1 file changed, 19 insertions(+), 12 deletions(-)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index e8a2f31..c46d6e4 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -146,8 +146,6 @@ struct throtl_grp {

unsigned long last_check_time;

- unsigned long last_dispatch_time[2];
-
/* When did we start a new slice */
unsigned long slice_start[2];
unsigned long slice_end[2];
@@ -489,8 +487,6 @@ static void throtl_pd_online(struct blkg_policy_data *pd)
* Update has_rules[] after a new group is brought online.
*/
tg_update_has_rules(tg);
- tg->last_dispatch_time[READ] = jiffies;
- tg->last_dispatch_time[WRITE] = jiffies;
}

static void blk_throtl_update_valid_limit(struct throtl_data *td)
@@ -1664,9 +1660,8 @@ static bool throtl_upgrade_check_one(struct throtl_grp *tg)
return true;

if (time_after_eq(jiffies,
- tg->last_dispatch_time[READ] + tg->td->throtl_slice) &&
- time_after_eq(jiffies,
- tg->last_dispatch_time[WRITE] + tg->td->throtl_slice))
+ tg_last_high_overflow_time(tg) + tg->td->throtl_slice) &&
+ throtl_tg_is_idle(tg))
return true;
return false;
}
@@ -1715,6 +1710,19 @@ static bool throtl_can_upgrade(struct throtl_data *td,
return true;
}

+static void throtl_upgrade_check(struct throtl_grp *tg)
+{
+ if (tg->td->limit_index != LIMIT_HIGH)
+ return;
+
+ if (!time_after_eq(jiffies,
+ __tg_last_high_overflow_time(tg) + tg->td->throtl_slice))
+ return;
+
+ if (throtl_can_upgrade(tg->td, NULL))
+ throtl_upgrade_state(tg->td);
+}
+
static void throtl_upgrade_state(struct throtl_data *td)
{
struct cgroup_subsys_state *pos_css;
@@ -1749,15 +1757,14 @@ static bool throtl_downgrade_check_one(struct throtl_grp *tg)
struct throtl_data *td = tg->td;
unsigned long now = jiffies;

- if (time_after_eq(now, tg->last_dispatch_time[READ] + td->throtl_slice) &&
- time_after_eq(now, tg->last_dispatch_time[WRITE] + td->throtl_slice))
- return false;
/*
* If cgroup is below high limit, consider downgrade and throttle other
* cgroups
*/
if (time_after_eq(now, td->high_upgrade_time + td->throtl_slice) &&
- time_after_eq(now, tg_last_high_overflow_time(tg) + td->throtl_slice))
+ time_after_eq(now, tg_last_high_overflow_time(tg) + td->throtl_slice) &&
+ (!throtl_tg_is_idle(tg) ||
+ !list_empty(&tg_to_blkg(tg)->blkcg->css.children)))
return true;
return false;
}
@@ -1895,10 +1902,10 @@ bool blk_throtl_bio(struct request_queue *q, struct blkcg_gq *blkg,

again:
while (true) {
- tg->last_dispatch_time[rw] = jiffies;
if (tg->last_high_overflow_time[rw] == 0)
tg->last_high_overflow_time[rw] = jiffies;
throtl_downgrade_check(tg);
+ throtl_upgrade_check(tg);
/* throtl is FIFO - if bios are already queued, should queue */
if (sq->nr_queued[rw])
break;
--
2.9.3

2016-10-03 21:21:52

by Shaohua Li

[permalink] [raw]
Subject: [PATCH v3 05/11] block-throttle: add downgrade logic

When queue state machine is in LIMIT_MAX state, but a cgroup is below
its high limit for some time, the queue should be downgraded to lower
state as one cgroup's high limit isn't met.

Signed-off-by: Shaohua Li <[email protected]>
---
block/blk-throttle.c | 187 +++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 187 insertions(+)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index db2ea15..41b75a3 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -136,6 +136,13 @@ struct throtl_grp {
/* Number of bio's dispatched in current slice */
unsigned int io_disp[2];

+ unsigned long last_high_overflow_time[2];
+
+ uint64_t last_bytes_disp[2];
+ unsigned int last_io_disp[2];
+
+ unsigned long last_check_time;
+
/* When did we start a new slice */
unsigned long slice_start[2];
unsigned long slice_end[2];
@@ -155,6 +162,9 @@ struct throtl_data
struct work_struct dispatch_work;
unsigned int limit_index;
bool limit_valid[LIMIT_CNT];
+
+ unsigned long high_upgrade_time;
+ unsigned long high_downgrade_time;
};

static void throtl_pending_timer_fn(unsigned long arg);
@@ -896,6 +906,8 @@ static void throtl_charge_bio(struct throtl_grp *tg, struct bio *bio)
/* Charge the bio to the group */
tg->bytes_disp[rw] += bio->bi_iter.bi_size;
tg->io_disp[rw]++;
+ tg->last_bytes_disp[rw] += bio->bi_iter.bi_size;
+ tg->last_io_disp[rw]++;

/*
* REQ_THROTTLED is used to prevent the same bio to be throttled
@@ -1510,6 +1522,64 @@ static struct blkcg_policy blkcg_policy_throtl = {
.pd_free_fn = throtl_pd_free,
};

+static unsigned long __tg_last_high_overflow_time(struct throtl_grp *tg)
+{
+ unsigned long rtime = -1, wtime = -1;
+ if (tg->bps[READ][LIMIT_HIGH] != -1 ||
+ tg->iops[READ][LIMIT_HIGH] != -1 ||
+ tg->bps[READ][LIMIT_MAX] != -1 ||
+ tg->iops[READ][LIMIT_MAX] != -1)
+ rtime = tg->last_high_overflow_time[READ];
+ if (tg->bps[WRITE][LIMIT_HIGH] != -1 ||
+ tg->iops[WRITE][LIMIT_HIGH] != -1 ||
+ tg->bps[WRITE][LIMIT_MAX] != -1 ||
+ tg->iops[WRITE][LIMIT_MAX] != -1)
+ wtime = tg->last_high_overflow_time[WRITE];
+ return min(rtime, wtime) == -1 ? 0 : min(rtime, wtime);
+}
+
+static unsigned long tg_last_high_overflow_time(struct throtl_grp *tg)
+{
+ struct throtl_service_queue *parent_sq;
+ struct throtl_grp *parent = tg;
+ unsigned long ret = __tg_last_high_overflow_time(tg);
+
+ while (true) {
+ parent_sq = parent->service_queue.parent_sq;
+ parent = sq_to_tg(parent_sq);
+ if (!parent)
+ break;
+ if (((parent->bps[READ][LIMIT_HIGH] != -1 &&
+ parent->bps[READ][LIMIT_HIGH] >=
+ tg->bps[READ][LIMIT_HIGH]) ||
+ (parent->bps[READ][LIMIT_HIGH] == -1 &&
+ parent->bps[READ][LIMIT_MAX] >=
+ tg->bps[READ][LIMIT_HIGH])) &&
+ ((parent->bps[WRITE][LIMIT_HIGH] != -1 &&
+ parent->bps[WRITE][LIMIT_HIGH] >=
+ tg->bps[WRITE][LIMIT_HIGH]) ||
+ (parent->bps[WRITE][LIMIT_HIGH] == -1 &&
+ parent->bps[WRITE][LIMIT_MAX] >=
+ tg->bps[WRITE][LIMIT_HIGH])) &&
+ ((parent->iops[READ][LIMIT_HIGH] != -1 &&
+ parent->iops[READ][LIMIT_HIGH] >=
+ tg->iops[READ][LIMIT_HIGH]) ||
+ (parent->iops[READ][LIMIT_HIGH] == -1 &&
+ parent->iops[READ][LIMIT_MAX] >=
+ tg->iops[READ][LIMIT_HIGH])) &&
+ ((parent->iops[WRITE][LIMIT_HIGH] != -1 &&
+ parent->iops[WRITE][LIMIT_HIGH] >=
+ tg->iops[WRITE][LIMIT_HIGH]) ||
+ (parent->iops[WRITE][LIMIT_HIGH] == -1 &&
+ parent->iops[WRITE][LIMIT_MAX] >=
+ tg->iops[WRITE][LIMIT_HIGH])))
+ break;
+ if (time_after(__tg_last_high_overflow_time(parent), ret))
+ ret = __tg_last_high_overflow_time(parent);
+ }
+ return ret;
+}
+
static bool throtl_upgrade_check_one(struct throtl_grp *tg)
{
struct throtl_service_queue *sq = &tg->service_queue;
@@ -1557,6 +1627,9 @@ static bool throtl_can_upgrade(struct throtl_data *td,
if (td->limit_index != LIMIT_HIGH)
return false;

+ if (time_before(jiffies, td->high_downgrade_time + throtl_slice))
+ return false;
+
rcu_read_lock();
blkg_for_each_descendant_post(blkg, pos_css, td->queue->root_blkg) {
struct throtl_grp *tg = blkg_to_tg(blkg);
@@ -1580,6 +1653,7 @@ static void throtl_upgrade_state(struct throtl_data *td)
struct blkcg_gq *blkg;

td->limit_index = LIMIT_MAX;
+ td->high_upgrade_time = jiffies;
rcu_read_lock();
blkg_for_each_descendant_post(blkg, pos_css, td->queue->root_blkg) {
struct throtl_grp *tg = blkg_to_tg(blkg);
@@ -1595,6 +1669,111 @@ static void throtl_upgrade_state(struct throtl_data *td)
queue_work(kthrotld_workqueue, &td->dispatch_work);
}

+static void throtl_downgrade_state(struct throtl_data *td, int new)
+{
+ td->limit_index = new;
+ td->high_downgrade_time = jiffies;
+}
+
+static bool throtl_downgrade_check_one(struct throtl_grp *tg)
+{
+ struct throtl_data *td = tg->td;
+ unsigned long now = jiffies;
+
+ /*
+ * If cgroup is below high limit, consider downgrade and throttle other
+ * cgroups
+ */
+ if (time_after_eq(now, td->high_upgrade_time + throtl_slice) &&
+ time_after_eq(now, tg_last_high_overflow_time(tg) + throtl_slice))
+ return true;
+ return false;
+}
+
+static bool throtl_downgrade_check_hierarchy(struct throtl_grp *tg)
+{
+ if (!throtl_downgrade_check_one(tg))
+ return false;
+ while (true) {
+ if (!tg || (cgroup_subsys_on_dfl(io_cgrp_subsys) &&
+ !tg_to_blkg(tg)->parent))
+ break;
+
+ if (!throtl_downgrade_check_one(tg))
+ return false;
+ tg = sq_to_tg(tg->service_queue.parent_sq);
+ }
+ return true;
+}
+
+static void throtl_downgrade_check(struct throtl_grp *tg)
+{
+ uint64_t bps;
+ unsigned int iops;
+ unsigned long elapsed_time;
+ unsigned long now = jiffies;
+
+ if (tg->td->limit_index != LIMIT_MAX ||
+ !tg->td->limit_valid[LIMIT_HIGH])
+ return;
+ if (!list_empty(&tg_to_blkg(tg)->blkcg->css.children))
+ return;
+ if (time_after(tg->last_check_time + throtl_slice, now))
+ return;
+
+ if (time_before(now, tg_last_high_overflow_time(tg) + throtl_slice))
+ return;
+
+ elapsed_time = now - tg->last_check_time;
+ tg->last_check_time = now;
+
+ if (tg->bps[READ][LIMIT_HIGH] != -1 ||
+ tg->bps[READ][LIMIT_MAX] != -1) {
+ bps = tg->last_bytes_disp[READ] * HZ;
+ do_div(bps, elapsed_time);
+ if (bps >= tg->bps[READ][LIMIT_HIGH] ||
+ bps >= tg->bps[READ][LIMIT_MAX])
+ tg->last_high_overflow_time[READ] = now;
+ }
+
+ if (tg->bps[WRITE][LIMIT_HIGH] != -1 ||
+ tg->bps[WRITE][LIMIT_MAX] != -1) {
+ bps = tg->last_bytes_disp[WRITE] * HZ;
+ do_div(bps, elapsed_time);
+ if (bps >= tg->bps[WRITE][LIMIT_HIGH] ||
+ bps >= tg->bps[WRITE][LIMIT_MAX])
+ tg->last_high_overflow_time[WRITE] = now;
+ }
+
+ if (tg->iops[READ][LIMIT_HIGH] != -1 ||
+ tg->iops[READ][LIMIT_MAX] != -1) {
+ iops = tg->last_io_disp[READ] * HZ / elapsed_time;
+ if (iops >= tg->iops[READ][LIMIT_HIGH] ||
+ iops >= tg->iops[READ][LIMIT_MAX])
+ tg->last_high_overflow_time[READ] = now;
+ }
+
+ if (tg->iops[WRITE][LIMIT_HIGH] != -1 ||
+ tg->iops[WRITE][LIMIT_MAX] != -1) {
+ iops = tg->last_io_disp[WRITE] * HZ / elapsed_time;
+ if (iops >= tg->iops[WRITE][LIMIT_HIGH] ||
+ iops >= tg->iops[WRITE][LIMIT_MAX])
+ tg->last_high_overflow_time[WRITE] = now;
+ }
+
+ /*
+ * If cgroup is below high limit, consider downgrade and throttle other
+ * cgroups
+ */
+ if (throtl_downgrade_check_hierarchy(tg))
+ throtl_downgrade_state(tg->td, LIMIT_HIGH);
+
+ tg->last_bytes_disp[READ] = 0;
+ tg->last_bytes_disp[WRITE] = 0;
+ tg->last_io_disp[READ] = 0;
+ tg->last_io_disp[WRITE] = 0;
+}
+
bool blk_throtl_bio(struct request_queue *q, struct blkcg_gq *blkg,
struct bio *bio)
{
@@ -1619,12 +1798,16 @@ bool blk_throtl_bio(struct request_queue *q, struct blkcg_gq *blkg,

again:
while (true) {
+ if (tg->last_high_overflow_time[rw] == 0)
+ tg->last_high_overflow_time[rw] = jiffies;
+ throtl_downgrade_check(tg);
/* throtl is FIFO - if bios are already queued, should queue */
if (sq->nr_queued[rw])
break;

/* if above limits, break to queue */
if (!tg_may_dispatch(tg, bio, NULL)) {
+ tg->last_high_overflow_time[rw] = jiffies;
if (throtl_can_upgrade(tg->td, tg)) {
throtl_upgrade_state(tg->td);
goto again;
@@ -1667,6 +1850,8 @@ bool blk_throtl_bio(struct request_queue *q, struct blkcg_gq *blkg,
tg->io_disp[rw], tg_iops_limit(tg, rw),
sq->nr_queued[READ], sq->nr_queued[WRITE]);

+ tg->last_high_overflow_time[rw] = jiffies;
+
bio_associate_current(bio);
tg->td->nr_queued[rw]++;
throtl_add_bio_tg(bio, qn, tg);
@@ -1778,6 +1963,8 @@ int blk_throtl_init(struct request_queue *q)
td->limit_valid[LIMIT_HIGH] = false;
td->limit_valid[LIMIT_MAX] = true;
td->limit_index = LIMIT_MAX;
+ td->high_upgrade_time = jiffies;
+ td->high_downgrade_time = jiffies;
/* activate policy */
ret = blkcg_activate_policy(q, &blkcg_policy_throtl);
if (ret)
--
2.9.3

2016-10-03 21:22:01

by Shaohua Li

[permalink] [raw]
Subject: [PATCH v3 01/11] block-throttle: prepare support multiple limits

We are going to support high/max limit, each cgroup will have 2 limits
after that. This patch prepares for the multiple limits change.

Signed-off-by: Shaohua Li <[email protected]>
---
block/blk-throttle.c | 109 ++++++++++++++++++++++++++++++++-------------------
1 file changed, 68 insertions(+), 41 deletions(-)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index a3ea826..964b713 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -83,6 +83,11 @@ enum tg_state_flags {

#define rb_entry_tg(node) rb_entry((node), struct throtl_grp, rb_node)

+enum {
+ LIMIT_MAX = 0,
+ LIMIT_CNT = 1,
+};
+
struct throtl_grp {
/* must be the first member */
struct blkg_policy_data pd;
@@ -120,10 +125,10 @@ struct throtl_grp {
bool has_rules[2];

/* bytes per second rate limits */
- uint64_t bps[2];
+ uint64_t bps[2][LIMIT_CNT];

/* IOPS limits */
- unsigned int iops[2];
+ unsigned int iops[2][LIMIT_CNT];

/* Number of bytes disptached in current slice */
uint64_t bytes_disp[2];
@@ -147,6 +152,8 @@ struct throtl_data

/* Work for dispatching throttled bios */
struct work_struct dispatch_work;
+ unsigned int limit_index;
+ bool limit_valid[LIMIT_CNT];
};

static void throtl_pending_timer_fn(unsigned long arg);
@@ -198,6 +205,16 @@ static struct throtl_data *sq_to_td(struct throtl_service_queue *sq)
return container_of(sq, struct throtl_data, service_queue);
}

+static uint64_t tg_bps_limit(struct throtl_grp *tg, int rw)
+{
+ return tg->bps[rw][tg->td->limit_index];
+}
+
+static unsigned int tg_iops_limit(struct throtl_grp *tg, int rw)
+{
+ return tg->iops[rw][tg->td->limit_index];
+}
+
/**
* throtl_log - log debug message via blktrace
* @sq: the service_queue being reported
@@ -320,7 +337,7 @@ static void throtl_service_queue_init(struct throtl_service_queue *sq)
static struct blkg_policy_data *throtl_pd_alloc(gfp_t gfp, int node)
{
struct throtl_grp *tg;
- int rw;
+ int rw, index;

tg = kzalloc_node(sizeof(*tg), gfp, node);
if (!tg)
@@ -334,10 +351,12 @@ static struct blkg_policy_data *throtl_pd_alloc(gfp_t gfp, int node)
}

RB_CLEAR_NODE(&tg->rb_node);
- tg->bps[READ] = -1;
- tg->bps[WRITE] = -1;
- tg->iops[READ] = -1;
- tg->iops[WRITE] = -1;
+ for (rw = READ; rw <= WRITE; rw++) {
+ for (index = LIMIT_MAX; index < LIMIT_CNT; index++) {
+ tg->bps[rw][index] = -1;
+ tg->iops[rw][index] = -1;
+ }
+ }

return &tg->pd;
}
@@ -376,11 +395,14 @@ static void throtl_pd_init(struct blkg_policy_data *pd)
static void tg_update_has_rules(struct throtl_grp *tg)
{
struct throtl_grp *parent_tg = sq_to_tg(tg->service_queue.parent_sq);
+ struct throtl_data *td = tg->td;
int rw;

for (rw = READ; rw <= WRITE; rw++)
tg->has_rules[rw] = (parent_tg && parent_tg->has_rules[rw]) ||
- (tg->bps[rw] != -1 || tg->iops[rw] != -1);
+ (td->limit_valid[td->limit_index] &&
+ (tg_bps_limit(tg, rw) != -1 ||
+ tg_iops_limit(tg, rw) != -1));
}

static void throtl_pd_online(struct blkg_policy_data *pd)
@@ -632,11 +654,11 @@ static inline void throtl_trim_slice(struct throtl_grp *tg, bool rw)

if (!nr_slices)
return;
- tmp = tg->bps[rw] * throtl_slice * nr_slices;
+ tmp = tg_bps_limit(tg, rw) * throtl_slice * nr_slices;
do_div(tmp, HZ);
bytes_trim = tmp;

- io_trim = (tg->iops[rw] * throtl_slice * nr_slices)/HZ;
+ io_trim = (tg_iops_limit(tg, rw) * throtl_slice * nr_slices)/HZ;

if (!bytes_trim && !io_trim)
return;
@@ -682,7 +704,7 @@ static bool tg_with_in_iops_limit(struct throtl_grp *tg, struct bio *bio,
* have been trimmed.
*/

- tmp = (u64)tg->iops[rw] * jiffy_elapsed_rnd;
+ tmp = (u64)tg_iops_limit(tg, rw) * jiffy_elapsed_rnd;
do_div(tmp, HZ);

if (tmp > UINT_MAX)
@@ -697,7 +719,7 @@ static bool tg_with_in_iops_limit(struct throtl_grp *tg, struct bio *bio,
}

/* Calc approx time to dispatch */
- jiffy_wait = ((tg->io_disp[rw] + 1) * HZ)/tg->iops[rw] + 1;
+ jiffy_wait = ((tg->io_disp[rw] + 1) * HZ)/tg_iops_limit(tg, rw) + 1;

if (jiffy_wait > jiffy_elapsed)
jiffy_wait = jiffy_wait - jiffy_elapsed;
@@ -724,7 +746,7 @@ static bool tg_with_in_bps_limit(struct throtl_grp *tg, struct bio *bio,

jiffy_elapsed_rnd = roundup(jiffy_elapsed_rnd, throtl_slice);

- tmp = tg->bps[rw] * jiffy_elapsed_rnd;
+ tmp = tg_bps_limit(tg, rw) * jiffy_elapsed_rnd;
do_div(tmp, HZ);
bytes_allowed = tmp;

@@ -736,7 +758,7 @@ static bool tg_with_in_bps_limit(struct throtl_grp *tg, struct bio *bio,

/* Calc approx time to dispatch */
extra_bytes = tg->bytes_disp[rw] + bio->bi_iter.bi_size - bytes_allowed;
- jiffy_wait = div64_u64(extra_bytes * HZ, tg->bps[rw]);
+ jiffy_wait = div64_u64(extra_bytes * HZ, tg_bps_limit(tg, rw));

if (!jiffy_wait)
jiffy_wait = 1;
@@ -771,7 +793,7 @@ static bool tg_may_dispatch(struct throtl_grp *tg, struct bio *bio,
bio != throtl_peek_queued(&tg->service_queue.queued[rw]));

/* If tg->bps = -1, then BW is unlimited */
- if (tg->bps[rw] == -1 && tg->iops[rw] == -1) {
+ if (tg_bps_limit(tg, rw) == -1 && tg_iops_limit(tg, rw) == -1) {
if (wait)
*wait = 0;
return true;
@@ -1148,8 +1170,8 @@ static void tg_conf_updated(struct throtl_grp *tg)

throtl_log(&tg->service_queue,
"limit change rbps=%llu wbps=%llu riops=%u wiops=%u",
- tg->bps[READ], tg->bps[WRITE],
- tg->iops[READ], tg->iops[WRITE]);
+ tg_bps_limit(tg, READ), tg_bps_limit(tg, WRITE),
+ tg_iops_limit(tg, READ), tg_iops_limit(tg, WRITE));

/*
* Update has_rules[] flags for the updated tg's subtree. A tg is
@@ -1226,25 +1248,25 @@ static ssize_t tg_set_conf_uint(struct kernfs_open_file *of,
static struct cftype throtl_legacy_files[] = {
{
.name = "throttle.read_bps_device",
- .private = offsetof(struct throtl_grp, bps[READ]),
+ .private = offsetof(struct throtl_grp, bps[READ][LIMIT_MAX]),
.seq_show = tg_print_conf_u64,
.write = tg_set_conf_u64,
},
{
.name = "throttle.write_bps_device",
- .private = offsetof(struct throtl_grp, bps[WRITE]),
+ .private = offsetof(struct throtl_grp, bps[WRITE][LIMIT_MAX]),
.seq_show = tg_print_conf_u64,
.write = tg_set_conf_u64,
},
{
.name = "throttle.read_iops_device",
- .private = offsetof(struct throtl_grp, iops[READ]),
+ .private = offsetof(struct throtl_grp, iops[READ][LIMIT_MAX]),
.seq_show = tg_print_conf_uint,
.write = tg_set_conf_uint,
},
{
.name = "throttle.write_iops_device",
- .private = offsetof(struct throtl_grp, iops[WRITE]),
+ .private = offsetof(struct throtl_grp, iops[WRITE][LIMIT_MAX]),
.seq_show = tg_print_conf_uint,
.write = tg_set_conf_uint,
},
@@ -1270,18 +1292,22 @@ static u64 tg_prfill_max(struct seq_file *sf, struct blkg_policy_data *pd,

if (!dname)
return 0;
- if (tg->bps[READ] == -1 && tg->bps[WRITE] == -1 &&
- tg->iops[READ] == -1 && tg->iops[WRITE] == -1)
+ if (tg->bps[READ][LIMIT_MAX] == -1 && tg->bps[WRITE][LIMIT_MAX] == -1 &&
+ tg->iops[READ][LIMIT_MAX] == -1 && tg->iops[WRITE][LIMIT_MAX] == -1)
return 0;

- if (tg->bps[READ] != -1)
- snprintf(bufs[0], sizeof(bufs[0]), "%llu", tg->bps[READ]);
- if (tg->bps[WRITE] != -1)
- snprintf(bufs[1], sizeof(bufs[1]), "%llu", tg->bps[WRITE]);
- if (tg->iops[READ] != -1)
- snprintf(bufs[2], sizeof(bufs[2]), "%u", tg->iops[READ]);
- if (tg->iops[WRITE] != -1)
- snprintf(bufs[3], sizeof(bufs[3]), "%u", tg->iops[WRITE]);
+ if (tg->bps[READ][LIMIT_MAX] != -1)
+ snprintf(bufs[0], sizeof(bufs[0]), "%llu",
+ tg->bps[READ][LIMIT_MAX]);
+ if (tg->bps[WRITE][LIMIT_MAX] != -1)
+ snprintf(bufs[1], sizeof(bufs[1]), "%llu",
+ tg->bps[WRITE][LIMIT_MAX]);
+ if (tg->iops[READ][LIMIT_MAX] != -1)
+ snprintf(bufs[2], sizeof(bufs[2]), "%u",
+ tg->iops[READ][LIMIT_MAX]);
+ if (tg->iops[WRITE][LIMIT_MAX] != -1)
+ snprintf(bufs[3], sizeof(bufs[3]), "%u",
+ tg->iops[WRITE][LIMIT_MAX]);

seq_printf(sf, "%s rbps=%s wbps=%s riops=%s wiops=%s\n",
dname, bufs[0], bufs[1], bufs[2], bufs[3]);
@@ -1310,10 +1336,10 @@ static ssize_t tg_set_max(struct kernfs_open_file *of,

tg = blkg_to_tg(ctx.blkg);

- v[0] = tg->bps[READ];
- v[1] = tg->bps[WRITE];
- v[2] = tg->iops[READ];
- v[3] = tg->iops[WRITE];
+ v[0] = tg->bps[READ][LIMIT_MAX];
+ v[1] = tg->bps[WRITE][LIMIT_MAX];
+ v[2] = tg->iops[READ][LIMIT_MAX];
+ v[3] = tg->iops[WRITE][LIMIT_MAX];

while (true) {
char tok[27]; /* wiops=18446744073709551616 */
@@ -1350,10 +1376,10 @@ static ssize_t tg_set_max(struct kernfs_open_file *of,
goto out_finish;
}

- tg->bps[READ] = v[0];
- tg->bps[WRITE] = v[1];
- tg->iops[READ] = v[2];
- tg->iops[WRITE] = v[3];
+ tg->bps[READ][LIMIT_MAX] = v[0];
+ tg->bps[WRITE][LIMIT_MAX] = v[1];
+ tg->iops[READ][LIMIT_MAX] = v[2];
+ tg->iops[WRITE][LIMIT_MAX] = v[3];

tg_conf_updated(tg);
ret = 0;
@@ -1451,8 +1477,8 @@ bool blk_throtl_bio(struct request_queue *q, struct blkcg_gq *blkg,
/* out-of-limit, queue to @tg */
throtl_log(sq, "[%c] bio. bdisp=%llu sz=%u bps=%llu iodisp=%u iops=%u queued=%d/%d",
rw == READ ? 'R' : 'W',
- tg->bytes_disp[rw], bio->bi_iter.bi_size, tg->bps[rw],
- tg->io_disp[rw], tg->iops[rw],
+ tg->bytes_disp[rw], bio->bi_iter.bi_size, tg_bps_limit(tg, rw),
+ tg->io_disp[rw], tg_iops_limit(tg, rw),
sq->nr_queued[READ], sq->nr_queued[WRITE]);

bio_associate_current(bio);
@@ -1563,6 +1589,7 @@ int blk_throtl_init(struct request_queue *q)
q->td = td;
td->queue = q;

+ td->limit_valid[LIMIT_MAX] = true;
/* activate policy */
ret = blkcg_activate_policy(q, &blkcg_policy_throtl);
if (ret)
--
2.9.3

2016-10-03 21:22:08

by Shaohua Li

[permalink] [raw]
Subject: [PATCH v3 03/11] block-throttle: configure bps/iops limit for cgroup in high limit

each queue will have a state machine. Initially queue is in LIMIT_HIGH
state, which means all cgroups will be throttled according to their high
limit. After all cgroups with high limit cross the limit, the queue state
gets upgraded to LIMIT_MAX state.
cgroups without high limit will use max limit for their high limit.

Signed-off-by: Shaohua Li <[email protected]>
---
block/blk-throttle.c | 21 +++++++++++++++++++--
1 file changed, 19 insertions(+), 2 deletions(-)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index 59d4b4c..e2b3704 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -208,12 +208,29 @@ static struct throtl_data *sq_to_td(struct throtl_service_queue *sq)

static uint64_t tg_bps_limit(struct throtl_grp *tg, int rw)
{
- return tg->bps[rw][tg->td->limit_index];
+ struct blkcg_gq *blkg = tg_to_blkg(tg);
+ uint64_t ret;
+
+ if (cgroup_subsys_on_dfl(io_cgrp_subsys) && !blkg->parent)
+ return -1;
+ ret = tg->bps[rw][tg->td->limit_index];
+ if (ret == -1 && tg->td->limit_index == LIMIT_HIGH)
+ return tg->bps[rw][LIMIT_MAX];
+
+ return ret;
}

static unsigned int tg_iops_limit(struct throtl_grp *tg, int rw)
{
- return tg->iops[rw][tg->td->limit_index];
+ struct blkcg_gq *blkg = tg_to_blkg(tg);
+ unsigned int ret;
+
+ if (cgroup_subsys_on_dfl(io_cgrp_subsys) && !blkg->parent)
+ return -1;
+ ret = tg->iops[rw][tg->td->limit_index];
+ if (ret == -1 && tg->td->limit_index == LIMIT_HIGH)
+ return tg->iops[rw][LIMIT_MAX];
+ return ret;
}

/**
--
2.9.3

2016-10-03 21:22:13

by Shaohua Li

[permalink] [raw]
Subject: [PATCH v3 06/11] blk-throttle: make sure expire time isn't too big

cgroup could be throttled to a limit but when all cgroups cross high
limit, queue enters a higher state and so the group should be throttled
to a higher limit. It's possible the cgroup is sleeping because of
throttle and other cgroups don't dispatch IO any more. In this case,
nobody can trigger current downgrade/upgrade logic. To fix this issue,
we could either set up a timer to wakeup the cgroup if other cgroups are
idle or make sure this cgroup doesn't sleep too long. Setting up a timer
means we must change the timer very frequently. This patch chooses the
latter. Making cgroup sleep time not too big wouldn't change cgroup
bps/iops, but could make it wakeup more frequently, which isn't a big
issue because throtl_slice * 8 is already quite big.

Signed-off-by: Shaohua Li <[email protected]>
---
block/blk-throttle.c | 3 +++
1 file changed, 3 insertions(+)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index 41b75a3..778de0b 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -589,6 +589,9 @@ static void throtl_dequeue_tg(struct throtl_grp *tg)
static void throtl_schedule_pending_timer(struct throtl_service_queue *sq,
unsigned long expires)
{
+ unsigned long max_expire = jiffies + 8 * throtl_slice;
+ if (time_after(expires, max_expire))
+ expires = max_expire;
mod_timer(&sq->pending_timer, expires);
throtl_log(sq, "schedule timer. delay=%lu jiffies=%lu",
expires - jiffies, jiffies);
--
2.9.3

2016-10-03 21:21:20

by Shaohua Li

[permalink] [raw]
Subject: [PATCH v3 09/11] block-throttle: make bandwidth change smooth

When cgroups all reach high limit, cgroups can dispatch more IO. This
could make some cgroups dispatch more IO but others not, and even some
cgroups could dispatch less IO than their high limit. For example, cg1
high limit 10MB/s, cg2 limit 80MB/s, assume disk maximum bandwidth is
120M/s for the workload. Their bps could something like this:

cg1/cg2 bps: T1: 10/80 -> T2: 60/60 -> T3: 10/80

At T1, all cgroups reach high limit, so they can dispatch more IO later.
Then cg1 dispatch more IO and cg2 has no room to dispatch enough IO. At
T2, cg2 only dispatches 60M/s. Since We detect cg2 dispatches less IO
than its high limit 80M/s, we downgrade the queue from LIMIT_MAX to
LIMIT_HIGH, then all cgroups are throttled to their high limit (T3). cg2
will have bandwidth below its high limit at most time.

The big problem here is we don't know the maximum bandwidth of the
workload, so we can't make smart decision to avoid the situation. This
patch makes cgroup bandwidth change smooth. After disk upgrades from
LIMIT_HIGH to LIMIT_MAX, we don't allow cgroups use all bandwidth upto
their max limit immediately. Their bandwidth limit will be increased
gradually to avoid above situation. So above example will became
something like:

cg1/cg2 bps: 10/80 -> 15/105 -> 20/100 -> 25/95 -> 30/90 -> 35/85 -> 40/80
-> 45/75 -> 10/80

In this way cgroups bandwidth will be above their limit in majority
time, this still doesn't fully utilize disk bandwidth, but that's
something we pay for sharing.

Note this doesn't completely avoid cgroup running under its high limit.
The best way to guarantee cgroup doesn't run under its limit is to set
max limit. For example, if we set cg1 max limit to 40, cg2 will never
run under its high limit.

Signed-off-by: Shaohua Li <[email protected]>
---
block/blk-throttle.c | 44 ++++++++++++++++++++++++++++++++++++++++----
1 file changed, 40 insertions(+), 4 deletions(-)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index 759bce1..1658b13 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -170,6 +170,8 @@ struct throtl_data

unsigned long high_upgrade_time;
unsigned long high_downgrade_time;
+
+ unsigned int scale;
};

static void throtl_pending_timer_fn(unsigned long arg);
@@ -224,12 +226,28 @@ static struct throtl_data *sq_to_td(struct throtl_service_queue *sq)
static uint64_t tg_bps_limit(struct throtl_grp *tg, int rw)
{
struct blkcg_gq *blkg = tg_to_blkg(tg);
+ struct throtl_data *td;
uint64_t ret;

if (cgroup_subsys_on_dfl(io_cgrp_subsys) && !blkg->parent)
return -1;
- ret = tg->bps[rw][tg->td->limit_index];
- if (ret == -1 && tg->td->limit_index == LIMIT_HIGH)
+ td = tg->td;
+ ret = tg->bps[rw][td->limit_index];
+ if (td->limit_index == LIMIT_MAX && tg->bps[rw][LIMIT_HIGH] != -1) {
+ uint64_t increase;
+
+ if (td->scale < 4096 && time_after_eq(jiffies,
+ td->high_upgrade_time + td->scale * td->throtl_slice)) {
+ uint64_t time = jiffies - td->high_upgrade_time;
+
+ do_div(time, td->throtl_slice);
+ td->scale = time;
+ }
+ increase = (tg->bps[rw][LIMIT_HIGH] >> 1) * td->scale;
+ ret = min(tg->bps[rw][LIMIT_MAX],
+ tg->bps[rw][LIMIT_HIGH] + increase);
+ }
+ if (ret == -1 && td->limit_index == LIMIT_HIGH)
return tg->bps[rw][LIMIT_MAX];

return ret;
@@ -238,12 +256,29 @@ static uint64_t tg_bps_limit(struct throtl_grp *tg, int rw)
static unsigned int tg_iops_limit(struct throtl_grp *tg, int rw)
{
struct blkcg_gq *blkg = tg_to_blkg(tg);
+ struct throtl_data *td;
unsigned int ret;

if (cgroup_subsys_on_dfl(io_cgrp_subsys) && !blkg->parent)
return -1;
- ret = tg->iops[rw][tg->td->limit_index];
- if (ret == -1 && tg->td->limit_index == LIMIT_HIGH)
+ td = tg->td;
+ ret = tg->iops[rw][td->limit_index];
+ if (td->limit_index == LIMIT_MAX && tg->iops[rw][LIMIT_HIGH] != -1) {
+ uint64_t increase;
+
+ if (td->scale < 4096 && time_after_eq(jiffies,
+ td->high_upgrade_time + td->scale * td->throtl_slice)) {
+ uint64_t time = jiffies - td->high_upgrade_time;
+
+ do_div(time, td->throtl_slice);
+ td->scale = time;
+ }
+
+ increase = (tg->iops[rw][LIMIT_HIGH] >> 1) * td->scale;
+ ret = min(tg->iops[rw][LIMIT_MAX],
+ tg->iops[rw][LIMIT_HIGH] + (unsigned int)increase);
+ }
+ if (ret == -1 && td->limit_index == LIMIT_HIGH)
return tg->iops[rw][LIMIT_MAX];
return ret;
}
@@ -1671,6 +1706,7 @@ static void throtl_upgrade_state(struct throtl_data *td)

td->limit_index = LIMIT_MAX;
td->high_upgrade_time = jiffies;
+ td->scale = 0;
rcu_read_lock();
blkg_for_each_descendant_post(blkg, pos_css, td->queue->root_blkg) {
struct throtl_grp *tg = blkg_to_tg(blkg);
--
2.9.3

2016-10-03 21:23:00

by Shaohua Li

[permalink] [raw]
Subject: [PATCH v3 02/11] block-throttle: add .high interface

Add high limit for cgroup and corresponding cgroup interface.

Signed-off-by: Shaohua Li <[email protected]>
---
block/blk-throttle.c | 139 +++++++++++++++++++++++++++++++++++++++------------
1 file changed, 107 insertions(+), 32 deletions(-)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index 964b713..59d4b4c 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -84,8 +84,9 @@ enum tg_state_flags {
#define rb_entry_tg(node) rb_entry((node), struct throtl_grp, rb_node)

enum {
- LIMIT_MAX = 0,
- LIMIT_CNT = 1,
+ LIMIT_HIGH = 0,
+ LIMIT_MAX = 1,
+ LIMIT_CNT = 2,
};

struct throtl_grp {
@@ -352,7 +353,7 @@ static struct blkg_policy_data *throtl_pd_alloc(gfp_t gfp, int node)

RB_CLEAR_NODE(&tg->rb_node);
for (rw = READ; rw <= WRITE; rw++) {
- for (index = LIMIT_MAX; index < LIMIT_CNT; index++) {
+ for (index = LIMIT_HIGH; index < LIMIT_CNT; index++) {
tg->bps[rw][index] = -1;
tg->iops[rw][index] = -1;
}
@@ -414,6 +415,46 @@ static void throtl_pd_online(struct blkg_policy_data *pd)
tg_update_has_rules(pd_to_tg(pd));
}

+static void blk_throtl_update_valid_limit(struct throtl_data *td)
+{
+ struct cgroup_subsys_state *pos_css;
+ struct blkcg_gq *blkg;
+ bool high_valid = false;
+
+ rcu_read_lock();
+ blkg_for_each_descendant_post(blkg, pos_css, td->queue->root_blkg) {
+ struct throtl_grp *tg = blkg_to_tg(blkg);
+
+ if (tg->bps[READ][LIMIT_HIGH] != -1 ||
+ tg->bps[WRITE][LIMIT_HIGH] != -1 ||
+ tg->iops[READ][LIMIT_HIGH] != -1 ||
+ tg->iops[WRITE][LIMIT_HIGH] != -1)
+ high_valid = true;
+ }
+ rcu_read_unlock();
+
+ if (high_valid)
+ td->limit_valid[LIMIT_HIGH] = true;
+ else
+ td->limit_valid[LIMIT_HIGH] = false;
+}
+
+static void throtl_pd_offline(struct blkg_policy_data *pd)
+{
+ struct throtl_grp *tg = pd_to_tg(pd);
+
+ tg->bps[READ][LIMIT_HIGH] = -1;
+ tg->bps[WRITE][LIMIT_HIGH] = -1;
+ tg->iops[READ][LIMIT_HIGH] = -1;
+ tg->iops[WRITE][LIMIT_HIGH] = -1;
+
+ blk_throtl_update_valid_limit(tg->td);
+
+ if (tg->td->limit_index == LIMIT_HIGH &&
+ !tg->td->limit_valid[LIMIT_HIGH])
+ tg->td->limit_index = LIMIT_MAX;
+}
+
static void throtl_pd_free(struct blkg_policy_data *pd)
{
struct throtl_grp *tg = pd_to_tg(pd);
@@ -1283,7 +1324,7 @@ static struct cftype throtl_legacy_files[] = {
{ } /* terminate */
};

-static u64 tg_prfill_max(struct seq_file *sf, struct blkg_policy_data *pd,
+static u64 tg_prfill_limit(struct seq_file *sf, struct blkg_policy_data *pd,
int off)
{
struct throtl_grp *tg = pd_to_tg(pd);
@@ -1292,36 +1333,32 @@ static u64 tg_prfill_max(struct seq_file *sf, struct blkg_policy_data *pd,

if (!dname)
return 0;
- if (tg->bps[READ][LIMIT_MAX] == -1 && tg->bps[WRITE][LIMIT_MAX] == -1 &&
- tg->iops[READ][LIMIT_MAX] == -1 && tg->iops[WRITE][LIMIT_MAX] == -1)
+ if (tg->bps[READ][off] == -1 && tg->bps[WRITE][off] == -1 &&
+ tg->iops[READ][off] == -1 && tg->iops[WRITE][off] == -1)
return 0;

- if (tg->bps[READ][LIMIT_MAX] != -1)
- snprintf(bufs[0], sizeof(bufs[0]), "%llu",
- tg->bps[READ][LIMIT_MAX]);
- if (tg->bps[WRITE][LIMIT_MAX] != -1)
- snprintf(bufs[1], sizeof(bufs[1]), "%llu",
- tg->bps[WRITE][LIMIT_MAX]);
- if (tg->iops[READ][LIMIT_MAX] != -1)
- snprintf(bufs[2], sizeof(bufs[2]), "%u",
- tg->iops[READ][LIMIT_MAX]);
- if (tg->iops[WRITE][LIMIT_MAX] != -1)
- snprintf(bufs[3], sizeof(bufs[3]), "%u",
- tg->iops[WRITE][LIMIT_MAX]);
+ if (tg->bps[READ][off] != -1)
+ snprintf(bufs[0], sizeof(bufs[0]), "%llu", tg->bps[READ][off]);
+ if (tg->bps[WRITE][off] != -1)
+ snprintf(bufs[1], sizeof(bufs[1]), "%llu", tg->bps[WRITE][off]);
+ if (tg->iops[READ][off] != -1)
+ snprintf(bufs[2], sizeof(bufs[2]), "%u", tg->iops[READ][off]);
+ if (tg->iops[WRITE][off] != -1)
+ snprintf(bufs[3], sizeof(bufs[3]), "%u", tg->iops[WRITE][off]);

seq_printf(sf, "%s rbps=%s wbps=%s riops=%s wiops=%s\n",
dname, bufs[0], bufs[1], bufs[2], bufs[3]);
return 0;
}

-static int tg_print_max(struct seq_file *sf, void *v)
+static int tg_print_limit(struct seq_file *sf, void *v)
{
- blkcg_print_blkgs(sf, css_to_blkcg(seq_css(sf)), tg_prfill_max,
+ blkcg_print_blkgs(sf, css_to_blkcg(seq_css(sf)), tg_prfill_limit,
&blkcg_policy_throtl, seq_cft(sf)->private, false);
return 0;
}

-static ssize_t tg_set_max(struct kernfs_open_file *of,
+static ssize_t tg_set_limit(struct kernfs_open_file *of,
char *buf, size_t nbytes, loff_t off)
{
struct blkcg *blkcg = css_to_blkcg(of_css(of));
@@ -1329,6 +1366,7 @@ static ssize_t tg_set_max(struct kernfs_open_file *of,
struct throtl_grp *tg;
u64 v[4];
int ret;
+ int index = of_cft(of)->private;

ret = blkg_conf_prep(blkcg, &blkcg_policy_throtl, buf, &ctx);
if (ret)
@@ -1336,10 +1374,10 @@ static ssize_t tg_set_max(struct kernfs_open_file *of,

tg = blkg_to_tg(ctx.blkg);

- v[0] = tg->bps[READ][LIMIT_MAX];
- v[1] = tg->bps[WRITE][LIMIT_MAX];
- v[2] = tg->iops[READ][LIMIT_MAX];
- v[3] = tg->iops[WRITE][LIMIT_MAX];
+ v[0] = tg->bps[READ][index];
+ v[1] = tg->bps[WRITE][index];
+ v[2] = tg->iops[READ][index];
+ v[3] = tg->iops[WRITE][index];

while (true) {
char tok[27]; /* wiops=18446744073709551616 */
@@ -1376,11 +1414,37 @@ static ssize_t tg_set_max(struct kernfs_open_file *of,
goto out_finish;
}

- tg->bps[READ][LIMIT_MAX] = v[0];
- tg->bps[WRITE][LIMIT_MAX] = v[1];
- tg->iops[READ][LIMIT_MAX] = v[2];
- tg->iops[WRITE][LIMIT_MAX] = v[3];
-
+ if (index == LIMIT_MAX) {
+ if ((v[0] < tg->bps[READ][LIMIT_HIGH] &&
+ tg->bps[READ][LIMIT_HIGH] != -1) ||
+ (v[1] < tg->bps[WRITE][LIMIT_HIGH] &&
+ tg->bps[WRITE][LIMIT_HIGH] != -1) ||
+ (v[2] < tg->iops[READ][LIMIT_HIGH] &&
+ tg->iops[READ][LIMIT_HIGH] != -1) ||
+ (v[3] < tg->iops[WRITE][LIMIT_HIGH] &&
+ tg->iops[WRITE][LIMIT_HIGH] != -1)) {
+ ret = -EINVAL;
+ goto out_finish;
+ }
+ } else if (index == LIMIT_HIGH) {
+ if ((v[0] > tg->bps[READ][LIMIT_MAX] && v[0] != -1) ||
+ (v[1] > tg->bps[WRITE][LIMIT_MAX] && v[1] != -1) ||
+ (v[2] > tg->iops[READ][LIMIT_MAX] && v[2] != -1) ||
+ (v[3] > tg->iops[WRITE][LIMIT_MAX] && v[3] != -1)) {
+ ret = -EINVAL;
+ goto out_finish;
+ }
+ }
+ tg->bps[READ][index] = v[0];
+ tg->bps[WRITE][index] = v[1];
+ tg->iops[READ][index] = v[2];
+ tg->iops[WRITE][index] = v[3];
+
+ if (index == LIMIT_HIGH) {
+ blk_throtl_update_valid_limit(tg->td);
+ if (tg->td->limit_valid[LIMIT_HIGH])
+ tg->td->limit_index = LIMIT_HIGH;
+ }
tg_conf_updated(tg);
ret = 0;
out_finish:
@@ -1390,10 +1454,18 @@ static ssize_t tg_set_max(struct kernfs_open_file *of,

static struct cftype throtl_files[] = {
{
+ .name = "high",
+ .flags = CFTYPE_NOT_ON_ROOT,
+ .seq_show = tg_print_limit,
+ .write = tg_set_limit,
+ .private = LIMIT_HIGH,
+ },
+ {
.name = "max",
.flags = CFTYPE_NOT_ON_ROOT,
- .seq_show = tg_print_max,
- .write = tg_set_max,
+ .seq_show = tg_print_limit,
+ .write = tg_set_limit,
+ .private = LIMIT_MAX,
},
{ } /* terminate */
};
@@ -1412,6 +1484,7 @@ static struct blkcg_policy blkcg_policy_throtl = {
.pd_alloc_fn = throtl_pd_alloc,
.pd_init_fn = throtl_pd_init,
.pd_online_fn = throtl_pd_online,
+ .pd_offline_fn = throtl_pd_offline,
.pd_free_fn = throtl_pd_free,
};

@@ -1589,7 +1662,9 @@ int blk_throtl_init(struct request_queue *q)
q->td = td;
td->queue = q;

+ td->limit_valid[LIMIT_HIGH] = false;
td->limit_valid[LIMIT_MAX] = true;
+ td->limit_index = LIMIT_MAX;
/* activate policy */
ret = blkcg_activate_policy(q, &blkcg_policy_throtl);
if (ret)
--
2.9.3

2016-10-03 21:23:13

by Shaohua Li

[permalink] [raw]
Subject: [PATCH v3 08/11] blk-throttle: detect completed idle cgroup

cgroup could be assigned a limit, but doesn't dispatch enough IO, eg the
cgroup is idle. When this happens, the cgroup doesn't hit its limit, so
we can't move the state machine to higher level and all cgroups will be
throttled to thier lower limit, so we waste bandwidth. Detecting idle
cgroup is hard. This patch handles a simple case, a cgroup doesn't
dispatch any IO. We ignore such cgroup's limit, so other cgroups can use
the bandwidth.

Signed-off-by: Shaohua Li <[email protected]>
---
block/blk-throttle.c | 17 ++++++++++++++++-
1 file changed, 16 insertions(+), 1 deletion(-)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index 4263f0c..759bce1 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -144,6 +144,8 @@ struct throtl_grp {

unsigned long last_check_time;

+ unsigned long last_dispatch_time[2];
+
/* When did we start a new slice */
unsigned long slice_start[2];
unsigned long slice_end[2];
@@ -438,11 +440,14 @@ static void tg_update_has_rules(struct throtl_grp *tg)

static void throtl_pd_online(struct blkg_policy_data *pd)
{
+ struct throtl_grp *tg = pd_to_tg(pd);
/*
* We don't want new groups to escape the limits of its ancestors.
* Update has_rules[] after a new group is brought online.
*/
- tg_update_has_rules(pd_to_tg(pd));
+ tg_update_has_rules(tg);
+ tg->last_dispatch_time[READ] = jiffies;
+ tg->last_dispatch_time[WRITE] = jiffies;
}

static void blk_throtl_update_valid_limit(struct throtl_data *td)
@@ -1606,6 +1611,12 @@ static bool throtl_upgrade_check_one(struct throtl_grp *tg)
if (write_limit && sq->nr_queued[WRITE] &&
(!read_limit || sq->nr_queued[READ]))
return true;
+
+ if (time_after_eq(jiffies,
+ tg->last_dispatch_time[READ] + tg->td->throtl_slice) &&
+ time_after_eq(jiffies,
+ tg->last_dispatch_time[WRITE] + tg->td->throtl_slice))
+ return true;
return false;
}

@@ -1686,6 +1697,9 @@ static bool throtl_downgrade_check_one(struct throtl_grp *tg)
struct throtl_data *td = tg->td;
unsigned long now = jiffies;

+ if (time_after_eq(now, tg->last_dispatch_time[READ] + td->throtl_slice) &&
+ time_after_eq(now, tg->last_dispatch_time[WRITE] + td->throtl_slice))
+ return false;
/*
* If cgroup is below high limit, consider downgrade and throttle other
* cgroups
@@ -1804,6 +1818,7 @@ bool blk_throtl_bio(struct request_queue *q, struct blkcg_gq *blkg,

again:
while (true) {
+ tg->last_dispatch_time[rw] = jiffies;
if (tg->last_high_overflow_time[rw] == 0)
tg->last_high_overflow_time[rw] = jiffies;
throtl_downgrade_check(tg);
--
2.9.3

2016-10-04 13:28:08

by Vivek Goyal

[permalink] [raw]
Subject: Re: [PATCH V3 00/11] block-throttle: add .high limit

On Mon, Oct 03, 2016 at 02:20:19PM -0700, Shaohua Li wrote:
> Hi,
>
> The background is we don't have an ioscheduler for blk-mq yet, so we can't
> prioritize processes/cgroups.

So this is an interim solution till we have ioscheduler for blk-mq?

> This patch set tries to add basic arbitration
> between cgroups with blk-throttle. It adds a new limit io.high for
> blk-throttle. It's only for cgroup2.
>
> io.max is a hard limit throttling. cgroups with a max limit never dispatch more
> IO than their max limit. While io.high is a best effort throttling. cgroups
> with high limit can run above their high limit at appropriate time.
> Specifically, if all cgroups reach their high limit, all cgroups can run above
> their high limit. If any cgroup runs under its high limit, all other cgroups
> will run according to their high limit.

Hi Shaohua,

I still don't understand why we should not implement a weight based
proportional IO mechanism and how this mechanism is better than proportional IO .

Agreed that we have issues with proportional IO and we don't have good
solutions for these problems. But I can't see that how this mechanism
will overcome these problems either.

IIRC, biggest issue with proportional IO was that a low prio group might
fill up the device queue with plenty of IO requests and later when high
prio cgroup comes, it will still experience latencies anyway. And solution
to the problem probably would be to get some awareness in device about
priority of request and map weights to those priority. That way higher
prio requests get prioritized.

Or run device at lower queue depth. That will improve latencies but migth
reduce overall throughput.

Or thorottle number of buffered writes (as Jens's writeback throttling)
patches were doing. Buffered writes seem to be biggest culprit for
increased latencies and being able to control these should help.

ioprio/weight based proportional IO mechanism is much more generic and
much easier to configure for any kind of storage. io.high is absolute
limit and makes it much harder to configure. One needs to know a lot
about underlying volume/device's bandwidth (which varies a lot anyway
based on workload).

IMHO, we seem to be trying to cater to one specific use case using
this mechanism. Something ioprio/weight based will be much more
generic and we should explore implementing that along with building
notion of ioprio in devices. When these two work together, we might
be able to see good results. Just software mechanism alone might not
be enough.

Vivek

2016-10-04 15:56:20

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH V3 00/11] block-throttle: add .high limit

Hello, Vivek.

On Tue, Oct 04, 2016 at 09:28:05AM -0400, Vivek Goyal wrote:
> On Mon, Oct 03, 2016 at 02:20:19PM -0700, Shaohua Li wrote:
> > Hi,
> >
> > The background is we don't have an ioscheduler for blk-mq yet, so we can't
> > prioritize processes/cgroups.
>
> So this is an interim solution till we have ioscheduler for blk-mq?

It's a common permanent solution which applies to both !mq and mq.

> > This patch set tries to add basic arbitration
> > between cgroups with blk-throttle. It adds a new limit io.high for
> > blk-throttle. It's only for cgroup2.
> >
> > io.max is a hard limit throttling. cgroups with a max limit never dispatch more
> > IO than their max limit. While io.high is a best effort throttling. cgroups
> > with high limit can run above their high limit at appropriate time.
> > Specifically, if all cgroups reach their high limit, all cgroups can run above
> > their high limit. If any cgroup runs under its high limit, all other cgroups
> > will run according to their high limit.
>
> Hi Shaohua,
>
> I still don't understand why we should not implement a weight based
> proportional IO mechanism and how this mechanism is better than proportional IO .

Oh, if we actually can implement proportional IO control, it'd be
great. The problem is that we have no way of knowing IO cost for
highspeed ssd devices. CFQ gets around the problem by using the
walltime as the measure of resource usage and scheduling time slices,
which works fine for rotating disks but horribly for highspeed ssds.

We can get some semblance of proportional control by just counting bw
or iops but both break down badly as a means to measure the actual
resource consumption depending on the workload. While limit based
control is more tedious to configure, it doesn't misrepresent what's
going on and is a lot less likely to produce surprising outcomes.

We *can* try to concoct something which tries to do proportional
control for highspeed ssds but that's gonna be quite a bit of
complexity and I'm not so sure it'd be justifiable given that we can't
even figure out measurement of the most basic operating unit.

> Agreed that we have issues with proportional IO and we don't have good
> solutions for these problems. But I can't see that how this mechanism
> will overcome these problems either.

It mostly defers the burden to the one who's configuring the limits
and expects it to know the characteristics of the device and workloads
and configure accordingly. It's quite a bit more tedious to use but
should be able to cover good portion of use cases without being overly
complicated. I agree that it'd be nice to have a simple proportional
control but as you said can't see a good solution for it at the
moment.

> IIRC, biggest issue with proportional IO was that a low prio group might
> fill up the device queue with plenty of IO requests and later when high
> prio cgroup comes, it will still experience latencies anyway. And solution
> to the problem probably would be to get some awareness in device about
> priority of request and map weights to those priority. That way higher
> prio requests get prioritized.

Nah, the real problem is that we can't even decide what the
proportions should be based on. The most fundamental part is missing.

> Or run device at lower queue depth. That will improve latencies but migth
> reduce overall throughput.

And that we can't do this (and thus basically operate close to
scheduling time slices) for highspeed ssds.

> Or thorottle number of buffered writes (as Jens's writeback throttling)
> patches were doing. Buffered writes seem to be biggest culprit for
> increased latencies and being able to control these should help.

That's a different topic.

> ioprio/weight based proportional IO mechanism is much more generic and
> much easier to configure for any kind of storage. io.high is absolute
> limit and makes it much harder to configure. One needs to know a lot
> about underlying volume/device's bandwidth (which varies a lot anyway
> based on workload).

Yeap, no disagreement there, but it still is a workable solution.

> IMHO, we seem to be trying to cater to one specific use case using
> this mechanism. Something ioprio/weight based will be much more
> generic and we should explore implementing that along with building
> notion of ioprio in devices. When these two work together, we might
> be able to see good results. Just software mechanism alone might not
> be enough.

I don't think it's catering to specific use cases. It is a generic
mechanism which demands knowledge and experimentation to configure.
It's more a way for the kernel to cop out and defer figuring out
device characteristics to userland. If you have a better idea, I'm
all ears.

Thanks.

--
tejun

2016-10-04 16:22:39

by Paolo Valente

[permalink] [raw]
Subject: Re: [PATCH V3 00/11] block-throttle: add .high limit


> Il giorno 04 ott 2016, alle ore 17:56, Tejun Heo <[email protected]> ha scritto:
>
> Hello, Vivek.
>
> On Tue, Oct 04, 2016 at 09:28:05AM -0400, Vivek Goyal wrote:
>> On Mon, Oct 03, 2016 at 02:20:19PM -0700, Shaohua Li wrote:
>>> Hi,
>>>
>>> The background is we don't have an ioscheduler for blk-mq yet, so we can't
>>> prioritize processes/cgroups.
>>
>> So this is an interim solution till we have ioscheduler for blk-mq?
>
> It's a common permanent solution which applies to both !mq and mq.
>
>>> This patch set tries to add basic arbitration
>>> between cgroups with blk-throttle. It adds a new limit io.high for
>>> blk-throttle. It's only for cgroup2.
>>>
>>> io.max is a hard limit throttling. cgroups with a max limit never dispatch more
>>> IO than their max limit. While io.high is a best effort throttling. cgroups
>>> with high limit can run above their high limit at appropriate time.
>>> Specifically, if all cgroups reach their high limit, all cgroups can run above
>>> their high limit. If any cgroup runs under its high limit, all other cgroups
>>> will run according to their high limit.
>>
>> Hi Shaohua,
>>
>> I still don't understand why we should not implement a weight based
>> proportional IO mechanism and how this mechanism is better than proportional IO .
>
> Oh, if we actually can implement proportional IO control, it'd be
> great. The problem is that we have no way of knowing IO cost for
> highspeed ssd devices. CFQ gets around the problem by using the
> walltime as the measure of resource usage and scheduling time slices,
> which works fine for rotating disks but horribly for highspeed ssds.
>

Could you please elaborate more on this point? BFQ uses sectors
served to measure service, and, on the all the fast devices on which
we have tested it, it accurately distributes
bandwidth as desired, redistributes excess bandwidth with any issue,
and guarantees high responsiveness and low latency at application and
system level (e.g., ~0 drop rate in video playback, with any background
workload tested).

Could you please suggest me some test to show how sector-based
guarantees fails?

Thanks,
Paolo

> We can get some semblance of proportional control by just counting bw
> or iops but both break down badly as a means to measure the actual
> resource consumption depending on the workload. While limit based
> control is more tedious to configure, it doesn't misrepresent what's
> going on and is a lot less likely to produce surprising outcomes.
>
> We *can* try to concoct something which tries to do proportional
> control for highspeed ssds but that's gonna be quite a bit of
> complexity and I'm not so sure it'd be justifiable given that we can't
> even figure out measurement of the most basic operating unit.
>
>> Agreed that we have issues with proportional IO and we don't have good
>> solutions for these problems. But I can't see that how this mechanism
>> will overcome these problems either.
>
> It mostly defers the burden to the one who's configuring the limits
> and expects it to know the characteristics of the device and workloads
> and configure accordingly. It's quite a bit more tedious to use but
> should be able to cover good portion of use cases without being overly
> complicated. I agree that it'd be nice to have a simple proportional
> control but as you said can't see a good solution for it at the
> moment.
>
>> IIRC, biggest issue with proportional IO was that a low prio group might
>> fill up the device queue with plenty of IO requests and later when high
>> prio cgroup comes, it will still experience latencies anyway. And solution
>> to the problem probably would be to get some awareness in device about
>> priority of request and map weights to those priority. That way higher
>> prio requests get prioritized.
>
> Nah, the real problem is that we can't even decide what the
> proportions should be based on. The most fundamental part is missing.
>
>> Or run device at lower queue depth. That will improve latencies but migth
>> reduce overall throughput.
>
> And that we can't do this (and thus basically operate close to
> scheduling time slices) for highspeed ssds.
>
>> Or thorottle number of buffered writes (as Jens's writeback throttling)
>> patches were doing. Buffered writes seem to be biggest culprit for
>> increased latencies and being able to control these should help.
>
> That's a different topic.
>
>> ioprio/weight based proportional IO mechanism is much more generic and
>> much easier to configure for any kind of storage. io.high is absolute
>> limit and makes it much harder to configure. One needs to know a lot
>> about underlying volume/device's bandwidth (which varies a lot anyway
>> based on workload).
>
> Yeap, no disagreement there, but it still is a workable solution.
>
>> IMHO, we seem to be trying to cater to one specific use case using
>> this mechanism. Something ioprio/weight based will be much more
>> generic and we should explore implementing that along with building
>> notion of ioprio in devices. When these two work together, we might
>> be able to see good results. Just software mechanism alone might not
>> be enough.
>
> I don't think it's catering to specific use cases. It is a generic
> mechanism which demands knowledge and experimentation to configure.
> It's more a way for the kernel to cop out and defer figuring out
> device characteristics to userland. If you have a better idea, I'm
> all ears.
>
> Thanks.
>
> --
> tejun
> --
> To unsubscribe from this list: send the line "unsubscribe linux-block" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html


--
Paolo Valente
Algogroup
Dipartimento di Scienze Fisiche, Informatiche e Matematiche
Via Campi 213/B
41125 Modena - Italy
http://algogroup.unimore.it/people/paolo/





2016-10-04 16:28:04

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH V3 00/11] block-throttle: add .high limit

Hello,

On Tue, Oct 04, 2016 at 06:22:28PM +0200, Paolo Valente wrote:
> Could you please elaborate more on this point? BFQ uses sectors
> served to measure service, and, on the all the fast devices on which
> we have tested it, it accurately distributes
> bandwidth as desired, redistributes excess bandwidth with any issue,
> and guarantees high responsiveness and low latency at application and
> system level (e.g., ~0 drop rate in video playback, with any background
> workload tested).

The same argument as before. Bandwidth is a very bad measure of IO
resources spent. For specific use cases (like desktop or whatever),
this can work but not generally.

> Could you please suggest me some test to show how sector-based
> guarantees fails?

Well, mix 4k random and sequential workloads and try to distribute the
acteual IO resources.

Thanks.

--
tejun

2016-10-04 17:01:53

by Paolo Valente

[permalink] [raw]
Subject: Re: [PATCH V3 00/11] block-throttle: add .high limit


> Il giorno 04 ott 2016, alle ore 18:27, Tejun Heo <[email protected]> ha scritto:
>
> Hello,
>
> On Tue, Oct 04, 2016 at 06:22:28PM +0200, Paolo Valente wrote:
>> Could you please elaborate more on this point? BFQ uses sectors
>> served to measure service, and, on the all the fast devices on which
>> we have tested it, it accurately distributes
>> bandwidth as desired, redistributes excess bandwidth with any issue,
>> and guarantees high responsiveness and low latency at application and
>> system level (e.g., ~0 drop rate in video playback, with any background
>> workload tested).
>
> The same argument as before. Bandwidth is a very bad measure of IO
> resources spent. For specific use cases (like desktop or whatever),
> this can work but not generally.
>

Actually, we have already discussed this point, and IMHO the arguments
that (apparently) convinced you that bandwidth is the most relevant
service guarantee for I/O in desktops and the like, prove that
bandwidth is the most important service guarantee in servers too.

Again, all the examples I can think of seem to confirm it:
. file hosting: a good service must guarantee reasonable read/write,
i.e., download/upload, speeds to users
. file streaming: a good service must guarantee low drop rates, and
this can be guaranteed only by guaranteeing bandwidth and latency
. web hosting: high bandwidth and low latency needed here too
. clouds: high bw and low latency needed to let, e.g., users of VMs
enjoy high responsiveness and, for example, reasonable file-copy
time
...

To put in yet another way, with packet I/O in, e.g., clouds, there are
basically the same issues, and the main goal is again guaranteeing
bandwidth and low latency among nodes.

Could you please provide a concrete server example (assuming we still
agree about desktops), where I/O bandwidth does not matter while time
does?


>> Could you please suggest me some test to show how sector-based
>> guarantees fails?
>
> Well, mix 4k random and sequential workloads and try to distribute the
> acteual IO resources.
>


If I'm not mistaken, we have already gone through this example too,
and I thought we agreed on what service scheme worked best, again
focusing only on desktops. To make a long story short(er), here is a
snippet from one of our last exchanges.

----------

On Sat, Apr 16, 2016 at 12:08:44AM +0200, Paolo Valente wrote:
> Maybe the source of confusion is the fact that a simple sector-based,
> proportional share scheduler always distributes total bandwidth
> according to weights. The catch is the additional BFQ rule: random
> workloads get only time isolation, and are charged for full budgets,
> so as to not affect the schedule of quasi-sequential workloads. So,
> the correct claim for BFQ is that it distributes total bandwidth
> according to weights (only) when all competing workloads are
> quasi-sequential. If some workloads are random, then these workloads
> are just time scheduled. This does break proportional-share bandwidth
> distribution with mixed workloads, but, much more importantly, saves
> both total throughput and individual bandwidths of quasi-sequential
> workloads.
>
> We could then check whether I did succeed in tuning timeouts and
> budgets so as to achieve the best tradeoffs. But this is probably a
> second-order problem as of now.

[you] Ah, I see. Yeah, that clears it up for me. I'm gonna play with
cgroup settings and see how it actually behaves.

---------

Why does the above argument not work for a server too? What am I
missing?

Thanks,
Paolo

> Thanks.
>
> --
> tejun


--
Paolo Valente
Algogroup
Dipartimento di Scienze Fisiche, Informatiche e Matematiche
Via Campi 213/B
41125 Modena - Italy
http://algogroup.unimore.it/people/paolo/





2016-10-04 17:08:42

by Shaohua Li

[permalink] [raw]
Subject: Re: [PATCH V3 00/11] block-throttle: add .high limit

Hi,
On Tue, Oct 04, 2016 at 09:28:05AM -0400, Vivek Goyal wrote:
> On Mon, Oct 03, 2016 at 02:20:19PM -0700, Shaohua Li wrote:
> > Hi,
> >
> > The background is we don't have an ioscheduler for blk-mq yet, so we can't
> > prioritize processes/cgroups.
>
> So this is an interim solution till we have ioscheduler for blk-mq?

This is still a generic solution to prioritize workloads.

> > This patch set tries to add basic arbitration
> > between cgroups with blk-throttle. It adds a new limit io.high for
> > blk-throttle. It's only for cgroup2.
> >
> > io.max is a hard limit throttling. cgroups with a max limit never dispatch more
> > IO than their max limit. While io.high is a best effort throttling. cgroups
> > with high limit can run above their high limit at appropriate time.
> > Specifically, if all cgroups reach their high limit, all cgroups can run above
> > their high limit. If any cgroup runs under its high limit, all other cgroups
> > will run according to their high limit.
>
> Hi Shaohua,
>
> I still don't understand why we should not implement a weight based
> proportional IO mechanism and how this mechanism is better than proportional IO .
>
> Agreed that we have issues with proportional IO and we don't have good
> solutions for these problems. But I can't see that how this mechanism
> will overcome these problems either.

No, I never declare this mechanism is better than proportional IO. The problem
with proportional IO is we don't have a mechanism to measure IO cost, which is
the core for proportional. This mechanism only prioritizes IO. It's not as
useful as proportional, but works for a lot of scenarios.

>
> IIRC, biggest issue with proportional IO was that a low prio group might
> fill up the device queue with plenty of IO requests and later when high
> prio cgroup comes, it will still experience latencies anyway. And solution
> to the problem probably would be to get some awareness in device about
> priority of request and map weights to those priority. That way higher
> prio requests get prioritized.
>
> Or run device at lower queue depth. That will improve latencies but migth
> reduce overall throughput.

Yep, this is the hardest part. It really depends on the tradeoff between
throughput and latency. Running device at low queue depth sounds working, but
the sacrifice is extremely high for modern SSD. Small size IO throughput has a
range from several MB/s to several GB/s depending on queue depth. If run device
at lower queue depth, the sacrific is big enough to make device sharing no
sense.

> Or thorottle number of buffered writes (as Jens's writeback throttling)
> patches were doing. Buffered writes seem to be biggest culprit for
> increased latencies and being able to control these should help.

big size read can significantly increase latency too. Please note latency isn't
the only factor applications care about. non-interactive workloads don't care
about single IO latency, throughput or amortized latency is more important for
such workloads.

> ioprio/weight based proportional IO mechanism is much more generic and
> much easier to configure for any kind of storage. io.high is absolute
> limit and makes it much harder to configure. One needs to know a lot
> about underlying volume/device's bandwidth (which varies a lot anyway
> based on workload).
> IMHO, we seem to be trying to cater to one specific use case using
> this mechanism. Something ioprio/weight based will be much more
> generic and we should explore implementing that along with building
> notion of ioprio in devices. When these two work together, we might
> be able to see good results. Just software mechanism alone might not
> be enough.

Agree, proportional IO mechanism is easier to configure. The problem we can't
build it without hardware support.

Thanks,
Shaohua

2016-10-04 17:29:23

by Shaohua Li

[permalink] [raw]
Subject: Re: [PATCH V3 00/11] block-throttle: add .high limit

On Tue, Oct 04, 2016 at 07:01:39PM +0200, Paolo Valente wrote:
>
> > Il giorno 04 ott 2016, alle ore 18:27, Tejun Heo <[email protected]> ha scritto:
> >
> > Hello,
> >
> > On Tue, Oct 04, 2016 at 06:22:28PM +0200, Paolo Valente wrote:
> >> Could you please elaborate more on this point? BFQ uses sectors
> >> served to measure service, and, on the all the fast devices on which
> >> we have tested it, it accurately distributes
> >> bandwidth as desired, redistributes excess bandwidth with any issue,
> >> and guarantees high responsiveness and low latency at application and
> >> system level (e.g., ~0 drop rate in video playback, with any background
> >> workload tested).
> >
> > The same argument as before. Bandwidth is a very bad measure of IO
> > resources spent. For specific use cases (like desktop or whatever),
> > this can work but not generally.
> >
>
> Actually, we have already discussed this point, and IMHO the arguments
> that (apparently) convinced you that bandwidth is the most relevant
> service guarantee for I/O in desktops and the like, prove that
> bandwidth is the most important service guarantee in servers too.
>
> Again, all the examples I can think of seem to confirm it:
> . file hosting: a good service must guarantee reasonable read/write,
> i.e., download/upload, speeds to users
> . file streaming: a good service must guarantee low drop rates, and
> this can be guaranteed only by guaranteeing bandwidth and latency
> . web hosting: high bandwidth and low latency needed here too
> . clouds: high bw and low latency needed to let, e.g., users of VMs
> enjoy high responsiveness and, for example, reasonable file-copy
> time
> ...
>
> To put in yet another way, with packet I/O in, e.g., clouds, there are
> basically the same issues, and the main goal is again guaranteeing
> bandwidth and low latency among nodes.
>
> Could you please provide a concrete server example (assuming we still
> agree about desktops), where I/O bandwidth does not matter while time
> does?

I don't think IO bandwidth does not matter. The problem is bandwidth can't
measure IO cost. For example, you can't say 8k IO costs 2x IO resource than 4k
IO.

> >> Could you please suggest me some test to show how sector-based
> >> guarantees fails?
> >
> > Well, mix 4k random and sequential workloads and try to distribute the
> > acteual IO resources.
> >
>
>
> If I'm not mistaken, we have already gone through this example too,
> and I thought we agreed on what service scheme worked best, again
> focusing only on desktops. To make a long story short(er), here is a
> snippet from one of our last exchanges.
>
> ----------
>
> On Sat, Apr 16, 2016 at 12:08:44AM +0200, Paolo Valente wrote:
> > Maybe the source of confusion is the fact that a simple sector-based,
> > proportional share scheduler always distributes total bandwidth
> > according to weights. The catch is the additional BFQ rule: random
> > workloads get only time isolation, and are charged for full budgets,
> > so as to not affect the schedule of quasi-sequential workloads. So,
> > the correct claim for BFQ is that it distributes total bandwidth
> > according to weights (only) when all competing workloads are
> > quasi-sequential. If some workloads are random, then these workloads
> > are just time scheduled. This does break proportional-share bandwidth
> > distribution with mixed workloads, but, much more importantly, saves
> > both total throughput and individual bandwidths of quasi-sequential
> > workloads.
> >
> > We could then check whether I did succeed in tuning timeouts and
> > budgets so as to achieve the best tradeoffs. But this is probably a
> > second-order problem as of now.

I don't see why random/sequential matters for SSD. what really matters is
request size and IO depth. Time scheduling is skeptical too, as workloads can
dispatch all IO within almost 0 time in high queue depth disks.

Thanks,
Shaohua

2016-10-04 17:44:02

by Paolo Valente

[permalink] [raw]
Subject: Re: [PATCH V3 00/11] block-throttle: add .high limit


> Il giorno 04 ott 2016, alle ore 19:28, Shaohua Li <[email protected]> ha scritto:
>
> On Tue, Oct 04, 2016 at 07:01:39PM +0200, Paolo Valente wrote:
>>
>>> Il giorno 04 ott 2016, alle ore 18:27, Tejun Heo <[email protected]> ha scritto:
>>>
>>> Hello,
>>>
>>> On Tue, Oct 04, 2016 at 06:22:28PM +0200, Paolo Valente wrote:
>>>> Could you please elaborate more on this point? BFQ uses sectors
>>>> served to measure service, and, on the all the fast devices on which
>>>> we have tested it, it accurately distributes
>>>> bandwidth as desired, redistributes excess bandwidth with any issue,
>>>> and guarantees high responsiveness and low latency at application and
>>>> system level (e.g., ~0 drop rate in video playback, with any background
>>>> workload tested).
>>>
>>> The same argument as before. Bandwidth is a very bad measure of IO
>>> resources spent. For specific use cases (like desktop or whatever),
>>> this can work but not generally.
>>>
>>
>> Actually, we have already discussed this point, and IMHO the arguments
>> that (apparently) convinced you that bandwidth is the most relevant
>> service guarantee for I/O in desktops and the like, prove that
>> bandwidth is the most important service guarantee in servers too.
>>
>> Again, all the examples I can think of seem to confirm it:
>> . file hosting: a good service must guarantee reasonable read/write,
>> i.e., download/upload, speeds to users
>> . file streaming: a good service must guarantee low drop rates, and
>> this can be guaranteed only by guaranteeing bandwidth and latency
>> . web hosting: high bandwidth and low latency needed here too
>> . clouds: high bw and low latency needed to let, e.g., users of VMs
>> enjoy high responsiveness and, for example, reasonable file-copy
>> time
>> ...
>>
>> To put in yet another way, with packet I/O in, e.g., clouds, there are
>> basically the same issues, and the main goal is again guaranteeing
>> bandwidth and low latency among nodes.
>>
>> Could you please provide a concrete server example (assuming we still
>> agree about desktops), where I/O bandwidth does not matter while time
>> does?
>
> I don't think IO bandwidth does not matter. The problem is bandwidth can't
> measure IO cost. For example, you can't say 8k IO costs 2x IO resource than 4k
> IO.
>

For what goal do you need to be able to say this, once you succeeded
in guaranteeing bandwidth and low latency to each
process/client/group/node/user?

>>>> Could you please suggest me some test to show how sector-based
>>>> guarantees fails?
>>>
>>> Well, mix 4k random and sequential workloads and try to distribute the
>>> acteual IO resources.
>>>
>>
>>
>> If I'm not mistaken, we have already gone through this example too,
>> and I thought we agreed on what service scheme worked best, again
>> focusing only on desktops. To make a long story short(er), here is a
>> snippet from one of our last exchanges.
>>
>> ----------
>>
>> On Sat, Apr 16, 2016 at 12:08:44AM +0200, Paolo Valente wrote:
>>> Maybe the source of confusion is the fact that a simple sector-based,
>>> proportional share scheduler always distributes total bandwidth
>>> according to weights. The catch is the additional BFQ rule: random
>>> workloads get only time isolation, and are charged for full budgets,
>>> so as to not affect the schedule of quasi-sequential workloads. So,
>>> the correct claim for BFQ is that it distributes total bandwidth
>>> according to weights (only) when all competing workloads are
>>> quasi-sequential. If some workloads are random, then these workloads
>>> are just time scheduled. This does break proportional-share bandwidth
>>> distribution with mixed workloads, but, much more importantly, saves
>>> both total throughput and individual bandwidths of quasi-sequential
>>> workloads.
>>>
>>> We could then check whether I did succeed in tuning timeouts and
>>> budgets so as to achieve the best tradeoffs. But this is probably a
>>> second-order problem as of now.
>
> I don't see why random/sequential matters for SSD. what really matters is
> request size and IO depth. Time scheduling is skeptical too, as workloads can
> dispatch all IO within almost 0 time in high queue depth disks.
>

That's an orthogonal issue. If what matter is, e.g., size, then it is
enough to replace "sequential I/O" with "large-request I/O". In case
I have been too vague, here is an example: I mean that, e.g, in an I/O
scheduler you replace the function that computes whether a queue is
seeky based on request distance, with a function based on
request size. And this is exactly what has been already done, for
example, in CFQ:

if (blk_queue_nonrot(cfqd->queue))
cfqq->seek_history |= (n_sec < CFQQ_SECT_THR_NONROT);
else
cfqq->seek_history |= (sdist > CFQQ_SEEK_THR);

Thanks,
Paolo

> Thanks,
> Shaohua


--
Paolo Valente
Algogroup
Dipartimento di Scienze Fisiche, Informatiche e Matematiche
Via Campi 213/B
41125 Modena - Italy
http://algogroup.unimore.it/people/paolo/





2016-10-04 18:12:49

by Vivek Goyal

[permalink] [raw]
Subject: Re: [PATCH V3 00/11] block-throttle: add .high limit

On Tue, Oct 04, 2016 at 11:56:16AM -0400, Tejun Heo wrote:
> Hello, Vivek.
>
> On Tue, Oct 04, 2016 at 09:28:05AM -0400, Vivek Goyal wrote:
> > On Mon, Oct 03, 2016 at 02:20:19PM -0700, Shaohua Li wrote:
> > > Hi,
> > >
> > > The background is we don't have an ioscheduler for blk-mq yet, so we can't
> > > prioritize processes/cgroups.
> >
> > So this is an interim solution till we have ioscheduler for blk-mq?
>
> It's a common permanent solution which applies to both !mq and mq.
>
> > > This patch set tries to add basic arbitration
> > > between cgroups with blk-throttle. It adds a new limit io.high for
> > > blk-throttle. It's only for cgroup2.
> > >
> > > io.max is a hard limit throttling. cgroups with a max limit never dispatch more
> > > IO than their max limit. While io.high is a best effort throttling. cgroups
> > > with high limit can run above their high limit at appropriate time.
> > > Specifically, if all cgroups reach their high limit, all cgroups can run above
> > > their high limit. If any cgroup runs under its high limit, all other cgroups
> > > will run according to their high limit.
> >
> > Hi Shaohua,
> >
> > I still don't understand why we should not implement a weight based
> > proportional IO mechanism and how this mechanism is better than proportional IO .
>
> Oh, if we actually can implement proportional IO control, it'd be
> great. The problem is that we have no way of knowing IO cost for
> highspeed ssd devices. CFQ gets around the problem by using the
> walltime as the measure of resource usage and scheduling time slices,
> which works fine for rotating disks but horribly for highspeed ssds.
>
> We can get some semblance of proportional control by just counting bw
> or iops but both break down badly as a means to measure the actual
> resource consumption depending on the workload. While limit based
> control is more tedious to configure, it doesn't misrepresent what's
> going on and is a lot less likely to produce surprising outcomes.
>
> We *can* try to concoct something which tries to do proportional
> control for highspeed ssds but that's gonna be quite a bit of
> complexity and I'm not so sure it'd be justifiable given that we can't
> even figure out measurement of the most basic operating unit.

Hi Tejun,

Agreed that we don't have a good basic unit to measure IO cost. I was
thinking of measuring cost in terms of sectors as that's simple and
gets more accurate on faster devices with almost no seek penalty. And
in fact this proposal is also providing fairness in terms of bandwitdh.
One extra feature seems to be this notion of minimum bandwidth for each
cgroup and until and unless all competing groups have met their minimum,
other cgroups can't cross their limits.

(BTW, should we call io.high, io.minimum instead. To say, this is the
minimum bandwidth group should get before others get to cross their
minimum limit till max limit).

>
> > Agreed that we have issues with proportional IO and we don't have good
> > solutions for these problems. But I can't see that how this mechanism
> > will overcome these problems either.
>
> It mostly defers the burden to the one who's configuring the limits
> and expects it to know the characteristics of the device and workloads
> and configure accordingly. It's quite a bit more tedious to use but
> should be able to cover good portion of use cases without being overly
> complicated. I agree that it'd be nice to have a simple proportional
> control but as you said can't see a good solution for it at the
> moment.

Ok, so idea is that if we can't provide something accurate in kernel,
then expose a very low level knob, which is harder to configure but
should work in some cases where users know their devices and workload
very well.

>
> > IIRC, biggest issue with proportional IO was that a low prio group might
> > fill up the device queue with plenty of IO requests and later when high
> > prio cgroup comes, it will still experience latencies anyway. And solution
> > to the problem probably would be to get some awareness in device about
> > priority of request and map weights to those priority. That way higher
> > prio requests get prioritized.
>
> Nah, the real problem is that we can't even decide what the
> proportions should be based on. The most fundamental part is missing.
>
> > Or run device at lower queue depth. That will improve latencies but migth
> > reduce overall throughput.
>
> And that we can't do this (and thus basically operate close to
> scheduling time slices) for highspeed ssds.
>
> > Or thorottle number of buffered writes (as Jens's writeback throttling)
> > patches were doing. Buffered writes seem to be biggest culprit for
> > increased latencies and being able to control these should help.
>
> That's a different topic.
>
> > ioprio/weight based proportional IO mechanism is much more generic and
> > much easier to configure for any kind of storage. io.high is absolute
> > limit and makes it much harder to configure. One needs to know a lot
> > about underlying volume/device's bandwidth (which varies a lot anyway
> > based on workload).
>
> Yeap, no disagreement there, but it still is a workable solution.
>
> > IMHO, we seem to be trying to cater to one specific use case using
> > this mechanism. Something ioprio/weight based will be much more
> > generic and we should explore implementing that along with building
> > notion of ioprio in devices. When these two work together, we might
> > be able to see good results. Just software mechanism alone might not
> > be enough.
>
> I don't think it's catering to specific use cases. It is a generic
> mechanism which demands knowledge and experimentation to configure.
> It's more a way for the kernel to cop out and defer figuring out
> device characteristics to userland. If you have a better idea, I'm
> all ears.

I don't think I have a better idea as such. Once we had talked and you
mentioned that for faster devices we should probably do some token based
mechanism (which I believe would probably mean sector based IO
accounting).

If a proportional IO controller based on sector as unit of measurement
not good enough and does not solve the issues real world workloads are
facing, then we can think of giving additional control in
blk-throttle to atleast get some of the use cases working.

Thanks
Vivek

2016-10-04 18:28:47

by Shaohua Li

[permalink] [raw]
Subject: Re: [PATCH V3 00/11] block-throttle: add .high limit

On Tue, Oct 04, 2016 at 07:43:48PM +0200, Paolo Valente wrote:
>
> > Il giorno 04 ott 2016, alle ore 19:28, Shaohua Li <[email protected]> ha scritto:
> >
> > On Tue, Oct 04, 2016 at 07:01:39PM +0200, Paolo Valente wrote:
> >>
> >>> Il giorno 04 ott 2016, alle ore 18:27, Tejun Heo <[email protected]> ha scritto:
> >>>
> >>> Hello,
> >>>
> >>> On Tue, Oct 04, 2016 at 06:22:28PM +0200, Paolo Valente wrote:
> >>>> Could you please elaborate more on this point? BFQ uses sectors
> >>>> served to measure service, and, on the all the fast devices on which
> >>>> we have tested it, it accurately distributes
> >>>> bandwidth as desired, redistributes excess bandwidth with any issue,
> >>>> and guarantees high responsiveness and low latency at application and
> >>>> system level (e.g., ~0 drop rate in video playback, with any background
> >>>> workload tested).
> >>>
> >>> The same argument as before. Bandwidth is a very bad measure of IO
> >>> resources spent. For specific use cases (like desktop or whatever),
> >>> this can work but not generally.
> >>>
> >>
> >> Actually, we have already discussed this point, and IMHO the arguments
> >> that (apparently) convinced you that bandwidth is the most relevant
> >> service guarantee for I/O in desktops and the like, prove that
> >> bandwidth is the most important service guarantee in servers too.
> >>
> >> Again, all the examples I can think of seem to confirm it:
> >> . file hosting: a good service must guarantee reasonable read/write,
> >> i.e., download/upload, speeds to users
> >> . file streaming: a good service must guarantee low drop rates, and
> >> this can be guaranteed only by guaranteeing bandwidth and latency
> >> . web hosting: high bandwidth and low latency needed here too
> >> . clouds: high bw and low latency needed to let, e.g., users of VMs
> >> enjoy high responsiveness and, for example, reasonable file-copy
> >> time
> >> ...
> >>
> >> To put in yet another way, with packet I/O in, e.g., clouds, there are
> >> basically the same issues, and the main goal is again guaranteeing
> >> bandwidth and low latency among nodes.
> >>
> >> Could you please provide a concrete server example (assuming we still
> >> agree about desktops), where I/O bandwidth does not matter while time
> >> does?
> >
> > I don't think IO bandwidth does not matter. The problem is bandwidth can't
> > measure IO cost. For example, you can't say 8k IO costs 2x IO resource than 4k
> > IO.
> >
>
> For what goal do you need to be able to say this, once you succeeded
> in guaranteeing bandwidth and low latency to each
> process/client/group/node/user?

I think we are discussing if bandwidth should be used to measure IO for
propotional IO scheduling. Since bandwidth can't measure the cost and you are
using it to do arbitration, you will either have low latency but unfair
bandwidth, or fair bandwidth but some workloads have unexpected high latency.
But it might be ok depending on the latency target (for example, you can set
the latency target high, so low latency is guaranteed*) and workload
characteristics. I think the bandwidth based proporional scheduling will only
work for workloads disk isn't fully utilized.

> >>>> Could you please suggest me some test to show how sector-based
> >>>> guarantees fails?
> >>>
> >>> Well, mix 4k random and sequential workloads and try to distribute the
> >>> acteual IO resources.
> >>>
> >>
> >>
> >> If I'm not mistaken, we have already gone through this example too,
> >> and I thought we agreed on what service scheme worked best, again
> >> focusing only on desktops. To make a long story short(er), here is a
> >> snippet from one of our last exchanges.
> >>
> >> ----------
> >>
> >> On Sat, Apr 16, 2016 at 12:08:44AM +0200, Paolo Valente wrote:
> >>> Maybe the source of confusion is the fact that a simple sector-based,
> >>> proportional share scheduler always distributes total bandwidth
> >>> according to weights. The catch is the additional BFQ rule: random
> >>> workloads get only time isolation, and are charged for full budgets,
> >>> so as to not affect the schedule of quasi-sequential workloads. So,
> >>> the correct claim for BFQ is that it distributes total bandwidth
> >>> according to weights (only) when all competing workloads are
> >>> quasi-sequential. If some workloads are random, then these workloads
> >>> are just time scheduled. This does break proportional-share bandwidth
> >>> distribution with mixed workloads, but, much more importantly, saves
> >>> both total throughput and individual bandwidths of quasi-sequential
> >>> workloads.
> >>>
> >>> We could then check whether I did succeed in tuning timeouts and
> >>> budgets so as to achieve the best tradeoffs. But this is probably a
> >>> second-order problem as of now.
> >
> > I don't see why random/sequential matters for SSD. what really matters is
> > request size and IO depth. Time scheduling is skeptical too, as workloads can
> > dispatch all IO within almost 0 time in high queue depth disks.
> >
>
> That's an orthogonal issue. If what matter is, e.g., size, then it is
> enough to replace "sequential I/O" with "large-request I/O". In case
> I have been too vague, here is an example: I mean that, e.g, in an I/O
> scheduler you replace the function that computes whether a queue is
> seeky based on request distance, with a function based on
> request size. And this is exactly what has been already done, for
> example, in CFQ:
>
> if (blk_queue_nonrot(cfqd->queue))
> cfqq->seek_history |= (n_sec < CFQQ_SECT_THR_NONROT);
> else
> cfqq->seek_history |= (sdist > CFQQ_SEEK_THR);

CFQ is known not fair for SSD especially high queue depth SSD, so this doesn't
mean correctness. And based on request size for idle detection (so let cfqq
backlog the disk) isn't very good. iodepth 1 4k workload could be idle, but
iodepth 128 4k workload likely isn't idle (and the workload can dispatch 128
requests in almost 0 time in high queue depth disk).

Thanks,
Shaohua

2016-10-04 18:50:24

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH V3 00/11] block-throttle: add .high limit

Hello, Vivek.

On Tue, Oct 04, 2016 at 02:12:45PM -0400, Vivek Goyal wrote:
> Agreed that we don't have a good basic unit to measure IO cost. I was
> thinking of measuring cost in terms of sectors as that's simple and
> gets more accurate on faster devices with almost no seek penalty. And

If this were true, we could simply base everything on bandwidth;
unfortunately, even highspeed ssds perform wildly differently
depending on the specifics of workloads.

> in fact this proposal is also providing fairness in terms of bandwitdh.
> One extra feature seems to be this notion of minimum bandwidth for each
> cgroup and until and unless all competing groups have met their minimum,
> other cgroups can't cross their limits.

Haven't read the patches yet but it should allow regulating in terms
of both bandwidth and iops.

> (BTW, should we call io.high, io.minimum instead. To say, this is the
> minimum bandwidth group should get before others get to cross their
> minimum limit till max limit).

The naming convetion is min, low, high, max but I'm not sure "min",
which means hard minimum amount (whether guaranteed or best-effort),
quite makes sense here.

> > It mostly defers the burden to the one who's configuring the limits
> > and expects it to know the characteristics of the device and workloads
> > and configure accordingly. It's quite a bit more tedious to use but
> > should be able to cover good portion of use cases without being overly
> > complicated. I agree that it'd be nice to have a simple proportional
> > control but as you said can't see a good solution for it at the
> > moment.
>
> Ok, so idea is that if we can't provide something accurate in kernel,
> then expose a very low level knob, which is harder to configure but
> should work in some cases where users know their devices and workload
> very well.

Yeah, that's the basic idea for this approach. It'd be great if we
eventually end up with proper proportional control but having
something low level is useful anyway, so...

> > I don't think it's catering to specific use cases. It is a generic
> > mechanism which demands knowledge and experimentation to configure.
> > It's more a way for the kernel to cop out and defer figuring out
> > device characteristics to userland. If you have a better idea, I'm
> > all ears.
>
> I don't think I have a better idea as such. Once we had talked and you
> mentioned that for faster devices we should probably do some token based
> mechanism (which I believe would probably mean sector based IO
> accounting).

That's more about the implementation strategy and doesn't affect
whether we support bw, iops or combined configurations. In terms of
implementation, I still think it'd be great to have something token
based with per-cpu batch to lower the cpu overhead on highspeed
devices but that shouldn't really affect the semantics.

Thanks.

--
tejun

2016-10-04 18:54:17

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH V3 00/11] block-throttle: add .high limit

Hello, Paolo.

On Tue, Oct 04, 2016 at 07:43:48PM +0200, Paolo Valente wrote:
> > I don't think IO bandwidth does not matter. The problem is bandwidth can't
> > measure IO cost. For example, you can't say 8k IO costs 2x IO resource than 4k
> > IO.
>
> For what goal do you need to be able to say this, once you succeeded
> in guaranteeing bandwidth and low latency to each
> process/client/group/node/user?

For resource partitioning mostly. It's not a single user or purpose
use case. The same device gets shared across unrelated workloads and
we need to guarantee differing levels of quality of service to each
regardless of the specifics of workload. We actually need to be able
to control IO resources.

Thanks.

--
tejun

2016-10-04 18:56:23

by Paolo Valente

[permalink] [raw]
Subject: Re: [PATCH V3 00/11] block-throttle: add .high limit


> Il giorno 04 ott 2016, alle ore 20:50, Tejun Heo <[email protected]> ha scritto:
>
> Hello, Vivek.
>
> On Tue, Oct 04, 2016 at 02:12:45PM -0400, Vivek Goyal wrote:
>> Agreed that we don't have a good basic unit to measure IO cost. I was
>> thinking of measuring cost in terms of sectors as that's simple and
>> gets more accurate on faster devices with almost no seek penalty. And
>
> If this were true, we could simply base everything on bandwidth;
> unfortunately, even highspeed ssds perform wildly differently
> depending on the specifics of workloads.
>

If you base your throttler or scheduler on time, and bandwidth varies
with workload, as you correctly point out, then the result is loss of
control on bandwidth distribution, hence unfairness and
hard-to-control (high) latency. If you use BFQ's approach, as we
already discussed with numbers and examples, you have stable fairness
and low latency. More precisely, given your workload, you can even
compute formally the strong service guarantees you provide.

Thanks,
Paolo

>> in fact this proposal is also providing fairness in terms of bandwitdh.
>> One extra feature seems to be this notion of minimum bandwidth for each
>> cgroup and until and unless all competing groups have met their minimum,
>> other cgroups can't cross their limits.
>
> Haven't read the patches yet but it should allow regulating in terms
> of both bandwidth and iops.
>
>> (BTW, should we call io.high, io.minimum instead. To say, this is the
>> minimum bandwidth group should get before others get to cross their
>> minimum limit till max limit).
>
> The naming convetion is min, low, high, max but I'm not sure "min",
> which means hard minimum amount (whether guaranteed or best-effort),
> quite makes sense here.
>
>>> It mostly defers the burden to the one who's configuring the limits
>>> and expects it to know the characteristics of the device and workloads
>>> and configure accordingly. It's quite a bit more tedious to use but
>>> should be able to cover good portion of use cases without being overly
>>> complicated. I agree that it'd be nice to have a simple proportional
>>> control but as you said can't see a good solution for it at the
>>> moment.
>>
>> Ok, so idea is that if we can't provide something accurate in kernel,
>> then expose a very low level knob, which is harder to configure but
>> should work in some cases where users know their devices and workload
>> very well.
>
> Yeah, that's the basic idea for this approach. It'd be great if we
> eventually end up with proper proportional control but having
> something low level is useful anyway, so...
>
>>> I don't think it's catering to specific use cases. It is a generic
>>> mechanism which demands knowledge and experimentation to configure.
>>> It's more a way for the kernel to cop out and defer figuring out
>>> device characteristics to userland. If you have a better idea, I'm
>>> all ears.
>>
>> I don't think I have a better idea as such. Once we had talked and you
>> mentioned that for faster devices we should probably do some token based
>> mechanism (which I believe would probably mean sector based IO
>> accounting).
>
> That's more about the implementation strategy and doesn't affect
> whether we support bw, iops or combined configurations. In terms of
> implementation, I still think it'd be great to have something token
> based with per-cpu batch to lower the cpu overhead on highspeed
> devices but that shouldn't really affect the semantics.
>
> Thanks.
>
> --
> tejun
> --
> To unsubscribe from this list: send the line "unsubscribe linux-block" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html


--
Paolo Valente
Algogroup
Dipartimento di Scienze Fisiche, Informatiche e Matematiche
Via Campi 213/B
41125 Modena - Italy
http://algogroup.unimore.it/people/paolo/





2016-10-04 19:02:57

by Paolo Valente

[permalink] [raw]
Subject: Re: [PATCH V3 00/11] block-throttle: add .high limit


> Il giorno 04 ott 2016, alle ore 20:54, Tejun Heo <[email protected]> ha scritto:
>
> Hello, Paolo.
>
> On Tue, Oct 04, 2016 at 07:43:48PM +0200, Paolo Valente wrote:
>>> I don't think IO bandwidth does not matter. The problem is bandwidth can't
>>> measure IO cost. For example, you can't say 8k IO costs 2x IO resource than 4k
>>> IO.
>>
>> For what goal do you need to be able to say this, once you succeeded
>> in guaranteeing bandwidth and low latency to each
>> process/client/group/node/user?
>
> For resource partitioning mostly. It's not a single user or purpose
> use case. The same device gets shared across unrelated workloads and
> we need to guarantee differing levels of quality of service to each
> regardless of the specifics of workload.

That's exactly what BFQ has succeeded in doing in all the tests
devised so far. Can you give me a concrete example for which I can
try with BFQ and with any other mechanism you deem better. If
you are right, numbers will just make your point.

Thanks,
Paolo

> We actually need to be able
> to control IO resources.
>


> Thanks.
>
> --
> tejun


--
Paolo Valente
Algogroup
Dipartimento di Scienze Fisiche, Informatiche e Matematiche
Via Campi 213/B
41125 Modena - Italy
http://algogroup.unimore.it/people/paolo/





2016-10-04 19:14:32

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH V3 00/11] block-throttle: add .high limit

Hello, Paolo.

On Tue, Oct 04, 2016 at 09:02:47PM +0200, Paolo Valente wrote:
> That's exactly what BFQ has succeeded in doing in all the tests
> devised so far. Can you give me a concrete example for which I can
> try with BFQ and with any other mechanism you deem better. If
> you are right, numbers will just make your point.

Hmm... I think we already discussed this but here's a really simple
case. There are three unknown workloads A, B and C and we want to
give A certain best-effort guarantees (let's say around 80% of the
underlying device) whether A is sharing the device with B or C.

I get that bfq can be a good compromise on most desktop workloads and
behave reasonably well for some server workloads with the slice
expiration mechanism but it really isn't an IO resource partitioning
mechanism.

Thanks.

--
tejun

2016-10-04 19:30:02

by Paolo Valente

[permalink] [raw]
Subject: Re: [PATCH V3 00/11] block-throttle: add .high limit


> Il giorno 04 ott 2016, alle ore 21:14, Tejun Heo <[email protected]> ha scritto:
>
> Hello, Paolo.
>
> On Tue, Oct 04, 2016 at 09:02:47PM +0200, Paolo Valente wrote:
>> That's exactly what BFQ has succeeded in doing in all the tests
>> devised so far. Can you give me a concrete example for which I can
>> try with BFQ and with any other mechanism you deem better. If
>> you are right, numbers will just make your point.
>
> Hmm... I think we already discussed this but here's a really simple
> case. There are three unknown workloads A, B and C and we want to
> give A certain best-effort guarantees (let's say around 80% of the
> underlying device) whether A is sharing the device with B or C.
>

That's the same example that you proposed me in our previous
discussion. For this example I showed you, with many boring numbers,
that with BFQ you get the most accurate distribution of the resource.

If you have enough stamina, I can repeat them again. To save your
patience, here is a very brief summary. In a concrete use case, the
unknown workloads turn into something like this: there will be a first
time interval during which A happens to be, say, sequential, B happens
to be, say, random and C happens to be, say, quasi-sequential. Then
there will be a next time interval during which their characteristics
change, and so on. It is easy (but boring, I acknowledge it) to show
that, for each of these time intervals BFQ provides the best possible
service in terms of fairness, bandwidth distribution, stability and so
on. Why? Because of the elastic bandwidth-time scheduling of BFQ
that we already discussed, and because BFQ is naturally accurate in
redistributing aggregate throughput proportionally, when needed.

> I get that bfq can be a good compromise on most desktop workloads and
> behave reasonably well for some server workloads with the slice
> expiration mechanism but it really isn't an IO resource partitioning
> mechanism.
>

Right. My argument is that BFQ enables you to give to each client the
bandwidth and low-latency guarantees you want. And this IMO is way
better than partitioning a resource and then getting unavoidable
unfairness and high latency.

Thanks,
Paolo

> Thanks.
>
> --
> tejun


--
Paolo Valente
Algogroup
Dipartimento di Scienze Fisiche, Informatiche e Matematiche
Via Campi 213/B
41125 Modena - Italy
http://algogroup.unimore.it/people/paolo/





2016-10-04 19:49:42

by Paolo Valente

[permalink] [raw]
Subject: Re: [PATCH V3 00/11] block-throttle: add .high limit


> Il giorno 04 ott 2016, alle ore 20:28, Shaohua Li <[email protected]> ha scritto:
>
> On Tue, Oct 04, 2016 at 07:43:48PM +0200, Paolo Valente wrote:
>>
>>> Il giorno 04 ott 2016, alle ore 19:28, Shaohua Li <[email protected]> ha scritto:
>>>
>>> On Tue, Oct 04, 2016 at 07:01:39PM +0200, Paolo Valente wrote:
>>>>
>>>>> Il giorno 04 ott 2016, alle ore 18:27, Tejun Heo <[email protected]> ha scritto:
>>>>>
>>>>> Hello,
>>>>>
>>>>> On Tue, Oct 04, 2016 at 06:22:28PM +0200, Paolo Valente wrote:
>>>>>> Could you please elaborate more on this point? BFQ uses sectors
>>>>>> served to measure service, and, on the all the fast devices on which
>>>>>> we have tested it, it accurately distributes
>>>>>> bandwidth as desired, redistributes excess bandwidth with any issue,
>>>>>> and guarantees high responsiveness and low latency at application and
>>>>>> system level (e.g., ~0 drop rate in video playback, with any background
>>>>>> workload tested).
>>>>>
>>>>> The same argument as before. Bandwidth is a very bad measure of IO
>>>>> resources spent. For specific use cases (like desktop or whatever),
>>>>> this can work but not generally.
>>>>>
>>>>
>>>> Actually, we have already discussed this point, and IMHO the arguments
>>>> that (apparently) convinced you that bandwidth is the most relevant
>>>> service guarantee for I/O in desktops and the like, prove that
>>>> bandwidth is the most important service guarantee in servers too.
>>>>
>>>> Again, all the examples I can think of seem to confirm it:
>>>> . file hosting: a good service must guarantee reasonable read/write,
>>>> i.e., download/upload, speeds to users
>>>> . file streaming: a good service must guarantee low drop rates, and
>>>> this can be guaranteed only by guaranteeing bandwidth and latency
>>>> . web hosting: high bandwidth and low latency needed here too
>>>> . clouds: high bw and low latency needed to let, e.g., users of VMs
>>>> enjoy high responsiveness and, for example, reasonable file-copy
>>>> time
>>>> ...
>>>>
>>>> To put in yet another way, with packet I/O in, e.g., clouds, there are
>>>> basically the same issues, and the main goal is again guaranteeing
>>>> bandwidth and low latency among nodes.
>>>>
>>>> Could you please provide a concrete server example (assuming we still
>>>> agree about desktops), where I/O bandwidth does not matter while time
>>>> does?
>>>
>>> I don't think IO bandwidth does not matter. The problem is bandwidth can't
>>> measure IO cost. For example, you can't say 8k IO costs 2x IO resource than 4k
>>> IO.
>>>
>>
>> For what goal do you need to be able to say this, once you succeeded
>> in guaranteeing bandwidth and low latency to each
>> process/client/group/node/user?
>
> I think we are discussing if bandwidth should be used to measure IO for
> propotional IO scheduling.


Yes. But my point is upstream. It's something like this:

Can bandwidth and low latency guarantees be provided with a
sector-based proportional-share scheduler?

YOUR ANSWER: No, then we need to look for other non-trivial solutions.
Hence your arguments in this discussion.

MY ANSWER: Yes, I have already achieved this goal for years now, with
a publicly available, proportional-share scheduler. A lot of test
results with many devices, papers discussing details, demos, and so on
are available too.

> Since bandwidth can't measure the cost and you are
> using it to do arbitration, you will either have low latency but unfair
> bandwidth, or fair bandwidth but some workloads have unexpected high latency.
> But it might be ok depending on the latency target (for example, you can set
> the latency target high, so low latency is guaranteed*) and workload
> characteristics. I think the bandwidth based proporional scheduling will only
> work for workloads disk isn't fully utilized.
>
>>>>>> Could you please suggest me some test to show how sector-based
>>>>>> guarantees fails?
>>>>>
>>>>> Well, mix 4k random and sequential workloads and try to distribute the
>>>>> acteual IO resources.
>>>>>
>>>>
>>>>
>>>> If I'm not mistaken, we have already gone through this example too,
>>>> and I thought we agreed on what service scheme worked best, again
>>>> focusing only on desktops. To make a long story short(er), here is a
>>>> snippet from one of our last exchanges.
>>>>
>>>> ----------
>>>>
>>>> On Sat, Apr 16, 2016 at 12:08:44AM +0200, Paolo Valente wrote:
>>>>> Maybe the source of confusion is the fact that a simple sector-based,
>>>>> proportional share scheduler always distributes total bandwidth
>>>>> according to weights. The catch is the additional BFQ rule: random
>>>>> workloads get only time isolation, and are charged for full budgets,
>>>>> so as to not affect the schedule of quasi-sequential workloads. So,
>>>>> the correct claim for BFQ is that it distributes total bandwidth
>>>>> according to weights (only) when all competing workloads are
>>>>> quasi-sequential. If some workloads are random, then these workloads
>>>>> are just time scheduled. This does break proportional-share bandwidth
>>>>> distribution with mixed workloads, but, much more importantly, saves
>>>>> both total throughput and individual bandwidths of quasi-sequential
>>>>> workloads.
>>>>>
>>>>> We could then check whether I did succeed in tuning timeouts and
>>>>> budgets so as to achieve the best tradeoffs. But this is probably a
>>>>> second-order problem as of now.
>>>
>>> I don't see why random/sequential matters for SSD. what really matters is
>>> request size and IO depth. Time scheduling is skeptical too, as workloads can
>>> dispatch all IO within almost 0 time in high queue depth disks.
>>>
>>
>> That's an orthogonal issue. If what matter is, e.g., size, then it is
>> enough to replace "sequential I/O" with "large-request I/O". In case
>> I have been too vague, here is an example: I mean that, e.g, in an I/O
>> scheduler you replace the function that computes whether a queue is
>> seeky based on request distance, with a function based on
>> request size. And this is exactly what has been already done, for
>> example, in CFQ:
>>
>> if (blk_queue_nonrot(cfqd->queue))
>> cfqq->seek_history |= (n_sec < CFQQ_SECT_THR_NONROT);
>> else
>> cfqq->seek_history |= (sdist > CFQQ_SEEK_THR);
>
> CFQ is known not fair for SSD especially high queue depth SSD, so this doesn't
> mean correctness.

I'm afraid CFQ is unfair for reasons that have little or nothing to do
with the above lines of code (which I pasted just to give you an
example, sorry for creating a misunderstanding).

> And based on request size for idle detection (so let cfqq
> backlog the disk) isn't very good. iodepth 1 4k workload could be idle, but
> iodepth 128 4k workload likely isn't idle (and the workload can dispatch 128
> requests in almost 0 time in high queue depth disk).
>

That's absolutely true. And it is one of the most challenging issues
I have addressed in BFQ. So far the solutions I have found proved to
work well. But, as I said to Tejun, if you have a concrete example
for which you expect BFQ to fail, just tell me and I will try.
Maximum depth is 32 with blk devices (if I'm not missing something,
given my limited expertise), but that would be probably enough to
prove your point.

Let me add just a comment, to not be misunderstood. I'm not
undervaluing your proposal. I'm trying to point out that sector-based
proportional share works, and it is likely to be the best solution
exactly with devices with varying bandwidth and deep queues. Yet I do
think that your proposal is a good and accurately-designed solution,
definitely necessary until good schedulers will be
available (of course I mean sector-based schedulers! ;) ).

Thanks,
Paolo

> Thanks,
> Shaohua


--
Paolo Valente
Algogroup
Dipartimento di Scienze Fisiche, Informatiche e Matematiche
Via Campi 213/B
41125 Modena - Italy
http://algogroup.unimore.it/people/paolo/





2016-10-04 20:27:59

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH V3 00/11] block-throttle: add .high limit

Hello, Paolo.

On Tue, Oct 04, 2016 at 09:29:48PM +0200, Paolo Valente wrote:
> > Hmm... I think we already discussed this but here's a really simple
> > case. There are three unknown workloads A, B and C and we want to
> > give A certain best-effort guarantees (let's say around 80% of the
> > underlying device) whether A is sharing the device with B or C.
>
> That's the same example that you proposed me in our previous
> discussion. For this example I showed you, with many boring numbers,
> that with BFQ you get the most accurate distribution of the resource.

Yes, it is about the same example and what I understood was that
"accurate distribution of the resources" holds as long as the
randomness is incidental (ie. due to layout on the filesystem and so
on) with the slice expiration mechanism offsetting the actually random
workloads.

> If you have enough stamina, I can repeat them again. To save your

I'll go back to the thread and re-read them.

> patience, here is a very brief summary. In a concrete use case, the
> unknown workloads turn into something like this: there will be a first
> time interval during which A happens to be, say, sequential, B happens
> to be, say, random and C happens to be, say, quasi-sequential. Then
> there will be a next time interval during which their characteristics
> change, and so on. It is easy (but boring, I acknowledge it) to show
> that, for each of these time intervals BFQ provides the best possible
> service in terms of fairness, bandwidth distribution, stability and so
> on. Why? Because of the elastic bandwidth-time scheduling of BFQ
> that we already discussed, and because BFQ is naturally accurate in
> redistributing aggregate throughput proportionally, when needed.

Yeah, that's what I remember and for workload above certain level of
randomness its time consumption is mapped to bw, right?

> > I get that bfq can be a good compromise on most desktop workloads and
> > behave reasonably well for some server workloads with the slice
> > expiration mechanism but it really isn't an IO resource partitioning
> > mechanism.
>
> Right. My argument is that BFQ enables you to give to each client the
> bandwidth and low-latency guarantees you want. And this IMO is way
> better than partitioning a resource and then getting unavoidable
> unfairness and high latency.

But that statement only holds while bw is the main thing to guarantee,
no? The level of isolation that we're looking for here is fairly
strict adherence to sub/few-milliseconds in terms of high percentile
scheduling latency while within the configured bw/iops limits, not
"overall this device is being used pretty well".

Thanks.

--
tejun

2016-10-05 12:37:15

by Paolo Valente

[permalink] [raw]
Subject: Re: [PATCH V3 00/11] block-throttle: add .high limit


> Il giorno 04 ott 2016, alle ore 22:27, Tejun Heo <[email protected]> ha scritto:
>
> Hello, Paolo.
>
> On Tue, Oct 04, 2016 at 09:29:48PM +0200, Paolo Valente wrote:
>>> Hmm... I think we already discussed this but here's a really simple
>>> case. There are three unknown workloads A, B and C and we want to
>>> give A certain best-effort guarantees (let's say around 80% of the
>>> underlying device) whether A is sharing the device with B or C.
>>
>> That's the same example that you proposed me in our previous
>> discussion. For this example I showed you, with many boring numbers,
>> that with BFQ you get the most accurate distribution of the resource.
>
> Yes, it is about the same example and what I understood was that
> "accurate distribution of the resources" holds as long as the
> randomness is incidental (ie. due to layout on the filesystem and so
> on) with the slice expiration mechanism offsetting the actually random
> workloads.
>

For completeness, this property holds whatever the workload is,
especially even if it changes.

>> If you have enough stamina, I can repeat them again. To save your
>
> I'll go back to the thread and re-read them.
>

Maybe we can make this less boring, see the end of this email.

>> patience, here is a very brief summary. In a concrete use case, the
>> unknown workloads turn into something like this: there will be a first
>> time interval during which A happens to be, say, sequential, B happens
>> to be, say, random and C happens to be, say, quasi-sequential. Then
>> there will be a next time interval during which their characteristics
>> change, and so on. It is easy (but boring, I acknowledge it) to show
>> that, for each of these time intervals BFQ provides the best possible
>> service in terms of fairness, bandwidth distribution, stability and so
>> on. Why? Because of the elastic bandwidth-time scheduling of BFQ
>> that we already discussed, and because BFQ is naturally accurate in
>> redistributing aggregate throughput proportionally, when needed.
>
> Yeah, that's what I remember and for workload above certain level of
> randomness its time consumption is mapped to bw, right?
>

Exactly.

>>> I get that bfq can be a good compromise on most desktop workloads and
>>> behave reasonably well for some server workloads with the slice
>>> expiration mechanism but it really isn't an IO resource partitioning
>>> mechanism.
>>
>> Right. My argument is that BFQ enables you to give to each client the
>> bandwidth and low-latency guarantees you want. And this IMO is way
>> better than partitioning a resource and then getting unavoidable
>> unfairness and high latency.
>
> But that statement only holds while bw is the main thing to guarantee,
> no? The level of isolation that we're looking for here is fairly
> strict adherence to sub/few-milliseconds in terms of high percentile
> scheduling latency while within the configured bw/iops limits, not
> "overall this device is being used pretty well".
>

Guaranteeing such a short-term latency, while guaranteeing not just bw
limits, but also proportional share distribution of the bw, is the
reason why we have devised BFQ years ago.

Anyway, to avoid going on with trying speculations and arguments, let
me retry with a practical proposal. BFQ is out there, free. Let's
just test, measure and check whether we have already a solution to
the problems you/we are still trying to solve in Linux.

In this respect, for your generic, unpredictable scenario to make
sense, there must exist at least one real system that meets the
requirements of such a scenario. Or, if such a real system does not
yet exist, it must be possible to emulate it. If it is impossible to
achieve this last goal either, then I miss the usefulness
of looking for solutions for such a scenario.

That said, let's define the instance(s) of the scenario that you find
most representative, and let's test BFQ on it/them. Numbers will give
us the answers. For example, what about all or part of the following
groups:
. one cyclically doing random I/O for some second and then sequential I/O
for the next seconds
. one doing, say, quasi-sequential I/O in ON/OFF cycles
. one starting an application cyclically
. one playing back or streaming a movie

For each group, we could then measure the time needed to complete each
phase of I/O in each cycle, plus the responsiveness in the group
starting an application, plus the frame drop in the group streaming
the movie. In addition, we can measure the bandwidth/iops enjoyed by
each group, plus, of course, the aggregate throughput of the whole
system. In particular we could compare results with throttling, BFQ,
and CFQ.

Then we could write resulting numbers on the stone, and stick to them
until something proves them wrong.

What do you (or others) think about it?

Thanks,
Paolo

> Thanks.
>
> --
> tejun
> --
> To unsubscribe from this list: send the line "unsubscribe linux-block" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html


--
Paolo Valente
Algogroup
Dipartimento di Scienze Fisiche, Informatiche e Matematiche
Via Campi 213/B
41125 Modena - Italy
http://algogroup.unimore.it/people/paolo/





2016-10-05 13:12:21

by Vivek Goyal

[permalink] [raw]
Subject: Re: [PATCH V3 00/11] block-throttle: add .high limit

On Wed, Oct 05, 2016 at 02:37:00PM +0200, Paolo Valente wrote:

[..]
> Anyway, to avoid going on with trying speculations and arguments, let
> me retry with a practical proposal. BFQ is out there, free. Let's
> just test, measure and check whether we have already a solution to
> the problems you/we are still trying to solve in Linux.

Hi Paolo,

Does BFQ implementaiton scale for fast storage devices using blk-mq
interface. We will want to make sure that locking and other overhead of
BFQ is very minimal so that overall throughput does not suffer.

Vivek

2016-10-05 14:05:11

by Paolo Valente

[permalink] [raw]
Subject: Re: [PATCH V3 00/11] block-throttle: add .high limit


> Il giorno 05 ott 2016, alle ore 15:12, Vivek Goyal <[email protected]> ha scritto:
>
> On Wed, Oct 05, 2016 at 02:37:00PM +0200, Paolo Valente wrote:
>
> [..]
>> Anyway, to avoid going on with trying speculations and arguments, let
>> me retry with a practical proposal. BFQ is out there, free. Let's
>> just test, measure and check whether we have already a solution to
>> the problems you/we are still trying to solve in Linux.
>
> Hi Paolo,
>
> Does BFQ implementaiton scale for fast storage devices using blk-mq
> interface. We will want to make sure that locking and other overhead of
> BFQ is very minimal so that overall throughput does not suffer.
>

Of course BFQ needs to be modified to work in blk-mq. I'm rather sure
its overhead will then be small enough, just because I have already
collaborated to a basically equivalent port from single to multi-queue for
packet scheduling (with Luigi Rizzo and others), and our prototype can
make over 15 million scheduling decisions per second, and keep latency
low, even with tens of concurrent clients running on a multi-core,
multi-socket system.

For details, here is the paper [1], plus some slides [2].

Actually, the solution in [1] is a global scheduler, which is more
complex than the first blk-mq version of BFQ that I have in mind,
namely, partitioned scheduling, in which there should be one
independent scheduler instance per core. But this is still investigation
territory. BTW, I would really appreciate help/feedback on this task [3].

Thanks,
Paolo

[1] http://info.iet.unipi.it/~luigi/papers/20160921-pspat.pdf
[2] http://info.iet.unipi.it/~luigi/pspat/
[3] https://marc.info/?l=linux-kernel&m=147066540916339&w=2

> Vivek
>


--
Paolo Valente
Algogroup
Dipartimento di Scienze Fisiche, Informatiche e Matematiche
Via Campi 213/B
41125 Modena - Italy
http://algogroup.unimore.it/people/paolo/





2016-10-05 14:49:51

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH V3 00/11] block-throttle: add .high limit

Hello, Paolo.

On Wed, Oct 05, 2016 at 02:37:00PM +0200, Paolo Valente wrote:
> In this respect, for your generic, unpredictable scenario to make
> sense, there must exist at least one real system that meets the
> requirements of such a scenario. Or, if such a real system does not
> yet exist, it must be possible to emulate it. If it is impossible to
> achieve this last goal either, then I miss the usefulness
> of looking for solutions for such a scenario.
>
> That said, let's define the instance(s) of the scenario that you find
> most representative, and let's test BFQ on it/them. Numbers will give
> us the answers. For example, what about all or part of the following
> groups:
> . one cyclically doing random I/O for some second and then sequential I/O
> for the next seconds
> . one doing, say, quasi-sequential I/O in ON/OFF cycles
> . one starting an application cyclically
> . one playing back or streaming a movie
>
> For each group, we could then measure the time needed to complete each
> phase of I/O in each cycle, plus the responsiveness in the group
> starting an application, plus the frame drop in the group streaming
> the movie. In addition, we can measure the bandwidth/iops enjoyed by
> each group, plus, of course, the aggregate throughput of the whole
> system. In particular we could compare results with throttling, BFQ,
> and CFQ.
>
> Then we could write resulting numbers on the stone, and stick to them
> until something proves them wrong.
>
> What do you (or others) think about it?

That sounds great and yeah it's lame that we didn't start with that.
Shaohua, would it be difficult to compare how bfq performs against
blk-throttle?

Thanks.

--
tejun

2016-10-05 18:31:22

by Shaohua Li

[permalink] [raw]
Subject: Re: [PATCH V3 00/11] block-throttle: add .high limit

On Wed, Oct 05, 2016 at 10:49:46AM -0400, Tejun Heo wrote:
> Hello, Paolo.
>
> On Wed, Oct 05, 2016 at 02:37:00PM +0200, Paolo Valente wrote:
> > In this respect, for your generic, unpredictable scenario to make
> > sense, there must exist at least one real system that meets the
> > requirements of such a scenario. Or, if such a real system does not
> > yet exist, it must be possible to emulate it. If it is impossible to
> > achieve this last goal either, then I miss the usefulness
> > of looking for solutions for such a scenario.
> >
> > That said, let's define the instance(s) of the scenario that you find
> > most representative, and let's test BFQ on it/them. Numbers will give
> > us the answers. For example, what about all or part of the following
> > groups:
> > . one cyclically doing random I/O for some second and then sequential I/O
> > for the next seconds
> > . one doing, say, quasi-sequential I/O in ON/OFF cycles
> > . one starting an application cyclically
> > . one playing back or streaming a movie
> >
> > For each group, we could then measure the time needed to complete each
> > phase of I/O in each cycle, plus the responsiveness in the group
> > starting an application, plus the frame drop in the group streaming
> > the movie. In addition, we can measure the bandwidth/iops enjoyed by
> > each group, plus, of course, the aggregate throughput of the whole
> > system. In particular we could compare results with throttling, BFQ,
> > and CFQ.
> >
> > Then we could write resulting numbers on the stone, and stick to them
> > until something proves them wrong.
> >
> > What do you (or others) think about it?
>
> That sounds great and yeah it's lame that we didn't start with that.
> Shaohua, would it be difficult to compare how bfq performs against
> blk-throttle?

I had a test of BFQ. I'm using BFQ found at
http://algogroup.unimore.it/people/paolo/disk_sched/sources.php. version is
4.7.0-v8r3. It's a LSI SSD, queue depth 32. I use default setting. fio script
is:

[global]
ioengine=libaio
direct=1
readwrite=randread
bs=4k
runtime=60
time_based=1
file_service_type=random:36
overwrite=1
thread=0
group_reporting=1
filename=/dev/sdb
iodepth=1
numjobs=8

[groupA]
prio=2

[groupB]
new_group
prio=6

I'll change iodepth, numjobs and prio in different tests. result unit is MB/s.

iodepth=1 numjobs=1 prio 4:4
CFQ: 28:28 BFQ: 21:21 deadline: 29:29

iodepth=8 numjobs=1 prio 4:4
CFQ: 162:162 BFQ: 102:98 deadline: 205:205

iodepth=1 numjobs=8 prio 4:4
CFQ: 157:157 BFQ: 81:92 deadline: 196:197

iodepth=1 numjobs=1 prio 2:6
CFQ: 26.7:27.6 BFQ: 20:6 deadline: 29:29

iodepth=8 numjobs=1 prio 2:6
CFQ: 166:174 BFQ: 139:72 deadline: 202:202

iodepth=1 numjobs=8 prio 2:6
CFQ: 148:150 BFQ: 90:77 deadline: 198:197

CFQ isn't fair at all. BFQ is very good in this side, but has poor throughput
even prio is the default value.

Thanks,
Shaohua

2016-10-05 19:09:22

by Shaohua Li

[permalink] [raw]
Subject: Re: [PATCH V3 00/11] block-throttle: add .high limit

On Wed, Oct 05, 2016 at 11:30:53AM -0700, Shaohua Li wrote:
> On Wed, Oct 05, 2016 at 10:49:46AM -0400, Tejun Heo wrote:
> > Hello, Paolo.
> >
> > On Wed, Oct 05, 2016 at 02:37:00PM +0200, Paolo Valente wrote:
> > > In this respect, for your generic, unpredictable scenario to make
> > > sense, there must exist at least one real system that meets the
> > > requirements of such a scenario. Or, if such a real system does not
> > > yet exist, it must be possible to emulate it. If it is impossible to
> > > achieve this last goal either, then I miss the usefulness
> > > of looking for solutions for such a scenario.
> > >
> > > That said, let's define the instance(s) of the scenario that you find
> > > most representative, and let's test BFQ on it/them. Numbers will give
> > > us the answers. For example, what about all or part of the following
> > > groups:
> > > . one cyclically doing random I/O for some second and then sequential I/O
> > > for the next seconds
> > > . one doing, say, quasi-sequential I/O in ON/OFF cycles
> > > . one starting an application cyclically
> > > . one playing back or streaming a movie
> > >
> > > For each group, we could then measure the time needed to complete each
> > > phase of I/O in each cycle, plus the responsiveness in the group
> > > starting an application, plus the frame drop in the group streaming
> > > the movie. In addition, we can measure the bandwidth/iops enjoyed by
> > > each group, plus, of course, the aggregate throughput of the whole
> > > system. In particular we could compare results with throttling, BFQ,
> > > and CFQ.
> > >
> > > Then we could write resulting numbers on the stone, and stick to them
> > > until something proves them wrong.
> > >
> > > What do you (or others) think about it?
> >
> > That sounds great and yeah it's lame that we didn't start with that.
> > Shaohua, would it be difficult to compare how bfq performs against
> > blk-throttle?
>
> I had a test of BFQ. I'm using BFQ found at
> http://algogroup.unimore.it/people/paolo/disk_sched/sources.php. version is
> 4.7.0-v8r3. It's a LSI SSD, queue depth 32. I use default setting. fio script
> is:
>
> [global]
> ioengine=libaio
> direct=1
> readwrite=randread
> bs=4k
> runtime=60
> time_based=1
> file_service_type=random:36
> overwrite=1
> thread=0
> group_reporting=1
> filename=/dev/sdb
> iodepth=1
> numjobs=8
>
> [groupA]
> prio=2
>
> [groupB]
> new_group
> prio=6
>
> I'll change iodepth, numjobs and prio in different tests. result unit is MB/s.
>
> iodepth=1 numjobs=1 prio 4:4
> CFQ: 28:28 BFQ: 21:21 deadline: 29:29
>
> iodepth=8 numjobs=1 prio 4:4
> CFQ: 162:162 BFQ: 102:98 deadline: 205:205
>
> iodepth=1 numjobs=8 prio 4:4
> CFQ: 157:157 BFQ: 81:92 deadline: 196:197
>
> iodepth=1 numjobs=1 prio 2:6
> CFQ: 26.7:27.6 BFQ: 20:6 deadline: 29:29
>
> iodepth=8 numjobs=1 prio 2:6
> CFQ: 166:174 BFQ: 139:72 deadline: 202:202
>
> iodepth=1 numjobs=8 prio 2:6
> CFQ: 148:150 BFQ: 90:77 deadline: 198:197

More tests:

iodepth=8 numjobs=1 prio 2:6, group A has 50M/s limit
CFQ:51:207 BFQ: 51:45 deadline: 51:216

iodepth=1 numjobs=1 prio 2:6, group A bs=4k, group B bs=64k
CFQ:25:249 BFQ: 23:42 deadline: 26:251

Thanks,
Shaohua

2016-10-05 19:47:37

by Paolo Valente

[permalink] [raw]
Subject: Re: [PATCH V3 00/11] block-throttle: add .high limit


> Il giorno 05 ott 2016, alle ore 20:30, Shaohua Li <[email protected]> ha scritto:
>
> On Wed, Oct 05, 2016 at 10:49:46AM -0400, Tejun Heo wrote:
>> Hello, Paolo.
>>
>> On Wed, Oct 05, 2016 at 02:37:00PM +0200, Paolo Valente wrote:
>>> In this respect, for your generic, unpredictable scenario to make
>>> sense, there must exist at least one real system that meets the
>>> requirements of such a scenario. Or, if such a real system does not
>>> yet exist, it must be possible to emulate it. If it is impossible to
>>> achieve this last goal either, then I miss the usefulness
>>> of looking for solutions for such a scenario.
>>>
>>> That said, let's define the instance(s) of the scenario that you find
>>> most representative, and let's test BFQ on it/them. Numbers will give
>>> us the answers. For example, what about all or part of the following
>>> groups:
>>> . one cyclically doing random I/O for some second and then sequential I/O
>>> for the next seconds
>>> . one doing, say, quasi-sequential I/O in ON/OFF cycles
>>> . one starting an application cyclically
>>> . one playing back or streaming a movie
>>>
>>> For each group, we could then measure the time needed to complete each
>>> phase of I/O in each cycle, plus the responsiveness in the group
>>> starting an application, plus the frame drop in the group streaming
>>> the movie. In addition, we can measure the bandwidth/iops enjoyed by
>>> each group, plus, of course, the aggregate throughput of the whole
>>> system. In particular we could compare results with throttling, BFQ,
>>> and CFQ.
>>>
>>> Then we could write resulting numbers on the stone, and stick to them
>>> until something proves them wrong.
>>>
>>> What do you (or others) think about it?
>>
>> That sounds great and yeah it's lame that we didn't start with that.
>> Shaohua, would it be difficult to compare how bfq performs against
>> blk-throttle?
>
> I had a test of BFQ.

Thank you very much for testing BFQ!

> I'm using BFQ found at
> http://algogroup.unimore.it/people/paolo/disk_sched/sources.php. version is
> 4.7.0-v8r3.

That's the latest stable version. The development version [1] already
contains further improvements for fairness, latency and throughput.
It is however still a release candidate.

[1] https://github.com/linusw/linux-bfq/tree/bfq-v8

> It's a LSI SSD, queue depth 32. I use default setting. fio script
> is:
>
> [global]
> ioengine=libaio
> direct=1
> readwrite=randread
> bs=4k
> runtime=60
> time_based=1
> file_service_type=random:36
> overwrite=1
> thread=0
> group_reporting=1
> filename=/dev/sdb
> iodepth=1
> numjobs=8
>
> [groupA]
> prio=2
>
> [groupB]
> new_group
> prio=6
>
> I'll change iodepth, numjobs and prio in different tests. result unit is MB/s.
>
> iodepth=1 numjobs=1 prio 4:4
> CFQ: 28:28 BFQ: 21:21 deadline: 29:29
>
> iodepth=8 numjobs=1 prio 4:4
> CFQ: 162:162 BFQ: 102:98 deadline: 205:205
>
> iodepth=1 numjobs=8 prio 4:4
> CFQ: 157:157 BFQ: 81:92 deadline: 196:197
>
> iodepth=1 numjobs=1 prio 2:6
> CFQ: 26.7:27.6 BFQ: 20:6 deadline: 29:29
>
> iodepth=8 numjobs=1 prio 2:6
> CFQ: 166:174 BFQ: 139:72 deadline: 202:202
>
> iodepth=1 numjobs=8 prio 2:6
> CFQ: 148:150 BFQ: 90:77 deadline: 198:197
>
> CFQ isn't fair at all. BFQ is very good in this side, but has poor throughput
> even prio is the default value.
>

Throughput is lower with BFQ for two reasons.

First, you certainly left the low_latency in its default state, i.e.,
on. As explained, e.g., here [2], low_latency mode is totally geared
towards maximum responsiveness and minimum latency for soft real-time
applications (e.g., video players). To achieve this goal, BFQ is
willing to perform more idling, when necessary. This lowers
throughput (I'll get back on this at the end of the discussion of the
second reason).

The second, most important reason, is that a minimum of idling is the
*only* way to achieve differentiated bandwidth distribution, as you
requested by setting different ioprios. I stress that this constraint
is not a technological accident, but a intrinsic, logical necessity.
The proof is simple, and if the following explanation is too boring or
confusing, I can show it to you with any trace of sync I/O.

First, to provide differentiated service, you need per-process
scheduling, i.e., schedulers in which there is a separate queue
associated with each process. Now, let A be the process with higher
weight (ioprio), and B the process with lower weight. Both processes
are sync, thus, by definition, they issue requests as follows: a few
requests (probably two, or a little bit more with larger iodepth),
then a little break to wait for request completion, then the next
small batch and so on. For each process, the queue associated with
the process (in the scheduler) is necessarily empty on the break. As
a consequence, if there is no idling, then every time A reaches its
break, the scheduler has only the option to switch to B (which is
extremely likely to have pending requests).

The service pattern of the processes then unavoidably becomes:

A B A B A B ...

where each letter represents a full small batch served for the
process. That is, 50% of the bw for each process, and complete loss
of control on the desired bandwidth distribution.

So, to sum up, the reason why BFQ achieves a lower total bw is that it
behaves in the only correct way to respect weights with sync I/O,
i.e., it performs a little idling. If low_latency is on, then BFQ
increases idling further, and this may be have caused further bw loss
in your test (but this varies greatly with devices, so you can
discover it only by trying).

The bottom line is that if you do want to achieve differentiation with
sync I/O, you have to pay a price in terms of bw, because of idling.
Actually, the recent preemption mechanism that I have introduced in
BFQ is proving so effective in preserving differentiation, that I'm
tempted to try some almost idleness solution. A little of accuracy
should however be sacrificed. Anyway, this is still work in progress.

Thank you very much,
Paolo

[2] http://algogroup.unimore.it/people/paolo/disk_sched/description.php

> Thanks,
> Shaohua
> --
> To unsubscribe from this list: send the line "unsubscribe linux-block" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html


--
Paolo Valente
Algogroup
Dipartimento di Scienze Fisiche, Informatiche e Matematiche
Via Campi 213/B
41125 Modena - Italy
http://algogroup.unimore.it/people/paolo/





2016-10-05 19:57:31

by Paolo Valente

[permalink] [raw]
Subject: Re: [PATCH V3 00/11] block-throttle: add .high limit


> Il giorno 05 ott 2016, alle ore 21:08, Shaohua Li <[email protected]> ha scritto:
>
> On Wed, Oct 05, 2016 at 11:30:53AM -0700, Shaohua Li wrote:
>> On Wed, Oct 05, 2016 at 10:49:46AM -0400, Tejun Heo wrote:
>>> Hello, Paolo.
>>>
>>> On Wed, Oct 05, 2016 at 02:37:00PM +0200, Paolo Valente wrote:
>>>> In this respect, for your generic, unpredictable scenario to make
>>>> sense, there must exist at least one real system that meets the
>>>> requirements of such a scenario. Or, if such a real system does not
>>>> yet exist, it must be possible to emulate it. If it is impossible to
>>>> achieve this last goal either, then I miss the usefulness
>>>> of looking for solutions for such a scenario.
>>>>
>>>> That said, let's define the instance(s) of the scenario that you find
>>>> most representative, and let's test BFQ on it/them. Numbers will give
>>>> us the answers. For example, what about all or part of the following
>>>> groups:
>>>> . one cyclically doing random I/O for some second and then sequential I/O
>>>> for the next seconds
>>>> . one doing, say, quasi-sequential I/O in ON/OFF cycles
>>>> . one starting an application cyclically
>>>> . one playing back or streaming a movie
>>>>
>>>> For each group, we could then measure the time needed to complete each
>>>> phase of I/O in each cycle, plus the responsiveness in the group
>>>> starting an application, plus the frame drop in the group streaming
>>>> the movie. In addition, we can measure the bandwidth/iops enjoyed by
>>>> each group, plus, of course, the aggregate throughput of the whole
>>>> system. In particular we could compare results with throttling, BFQ,
>>>> and CFQ.
>>>>
>>>> Then we could write resulting numbers on the stone, and stick to them
>>>> until something proves them wrong.
>>>>
>>>> What do you (or others) think about it?
>>>
>>> That sounds great and yeah it's lame that we didn't start with that.
>>> Shaohua, would it be difficult to compare how bfq performs against
>>> blk-throttle?
>>
>> I had a test of BFQ. I'm using BFQ found at
>> http://algogroup.unimore.it/people/paolo/disk_sched/sources.php. version is
>> 4.7.0-v8r3. It's a LSI SSD, queue depth 32. I use default setting. fio script
>> is:
>>
>> [global]
>> ioengine=libaio
>> direct=1
>> readwrite=randread
>> bs=4k
>> runtime=60
>> time_based=1
>> file_service_type=random:36
>> overwrite=1
>> thread=0
>> group_reporting=1
>> filename=/dev/sdb
>> iodepth=1
>> numjobs=8
>>
>> [groupA]
>> prio=2
>>
>> [groupB]
>> new_group
>> prio=6
>>
>> I'll change iodepth, numjobs and prio in different tests. result unit is MB/s.
>>
>> iodepth=1 numjobs=1 prio 4:4
>> CFQ: 28:28 BFQ: 21:21 deadline: 29:29
>>
>> iodepth=8 numjobs=1 prio 4:4
>> CFQ: 162:162 BFQ: 102:98 deadline: 205:205
>>
>> iodepth=1 numjobs=8 prio 4:4
>> CFQ: 157:157 BFQ: 81:92 deadline: 196:197
>>
>> iodepth=1 numjobs=1 prio 2:6
>> CFQ: 26.7:27.6 BFQ: 20:6 deadline: 29:29
>>
>> iodepth=8 numjobs=1 prio 2:6
>> CFQ: 166:174 BFQ: 139:72 deadline: 202:202
>>
>> iodepth=1 numjobs=8 prio 2:6
>> CFQ: 148:150 BFQ: 90:77 deadline: 198:197
>
> More tests:
>
> iodepth=8 numjobs=1 prio 2:6, group A has 50M/s limit
> CFQ:51:207 BFQ: 51:45 deadline: 51:216
>
> iodepth=1 numjobs=1 prio 2:6, group A bs=4k, group B bs=64k
> CFQ:25:249 BFQ: 23:42 deadline: 26:251
>

A true proportional share scheduler like BFQ works under the
assumption to be the only limiter of the bandwidth of its clients.
And the availability of such a scheduler should apparently make
bandwidth limiting useless: once you have a mechanism that allows you
to give each group the desired fraction of the bandwidth, and to
redistribute excess bandwidth seamlessly when needed, what do you need
additional limiting for?

But I'm not expert of any possible system configuration or
requirement. So, if you have practical examples, I would really
appreciate them. And I don't think it will be difficult to see what
goes wrong in BFQ with external bw limitation, and to fix the
problem.

Thanks,
Paolo

> Thanks,
> Shaohua
> --
> To unsubscribe from this list: send the line "unsubscribe linux-block" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html


--
Paolo Valente
Algogroup
Dipartimento di Scienze Fisiche, Informatiche e Matematiche
Via Campi 213/B
41125 Modena - Italy
http://algogroup.unimore.it/people/paolo/





2016-10-05 20:07:36

by Paolo Valente

[permalink] [raw]
Subject: Re: [PATCH V3 00/11] block-throttle: add .high limit


> Il giorno 05 ott 2016, alle ore 21:47, Paolo Valente <[email protected]> ha scritto:
>
>>
>> Il giorno 05 ott 2016, alle ore 20:30, Shaohua Li <[email protected]> ha scritto:
>>
>> On Wed, Oct 05, 2016 at 10:49:46AM -0400, Tejun Heo wrote:
>>> Hello, Paolo.
>>>
>>> On Wed, Oct 05, 2016 at 02:37:00PM +0200, Paolo Valente wrote:
>>>> In this respect, for your generic, unpredictable scenario to make
>>>> sense, there must exist at least one real system that meets the
>>>> requirements of such a scenario. Or, if such a real system does not
>>>> yet exist, it must be possible to emulate it. If it is impossible to
>>>> achieve this last goal either, then I miss the usefulness
>>>> of looking for solutions for such a scenario.
>>>>
>>>> That said, let's define the instance(s) of the scenario that you find
>>>> most representative, and let's test BFQ on it/them. Numbers will give
>>>> us the answers. For example, what about all or part of the following
>>>> groups:
>>>> . one cyclically doing random I/O for some second and then sequential I/O
>>>> for the next seconds
>>>> . one doing, say, quasi-sequential I/O in ON/OFF cycles
>>>> . one starting an application cyclically
>>>> . one playing back or streaming a movie
>>>>
>>>> For each group, we could then measure the time needed to complete each
>>>> phase of I/O in each cycle, plus the responsiveness in the group
>>>> starting an application, plus the frame drop in the group streaming
>>>> the movie. In addition, we can measure the bandwidth/iops enjoyed by
>>>> each group, plus, of course, the aggregate throughput of the whole
>>>> system. In particular we could compare results with throttling, BFQ,
>>>> and CFQ.
>>>>
>>>> Then we could write resulting numbers on the stone, and stick to them
>>>> until something proves them wrong.
>>>>
>>>> What do you (or others) think about it?
>>>
>>> That sounds great and yeah it's lame that we didn't start with that.
>>> Shaohua, would it be difficult to compare how bfq performs against
>>> blk-throttle?
>>
>> I had a test of BFQ.
>
> Thank you very much for testing BFQ!
>
>> I'm using BFQ found at
>> http://algogroup.unimore.it/people/paolo/disk_sched/sources.php. version is
>> 4.7.0-v8r3.
>
> That's the latest stable version. The development version [1] already
> contains further improvements for fairness, latency and throughput.
> It is however still a release candidate.
>
> [1] https://github.com/linusw/linux-bfq/tree/bfq-v8
>
>> It's a LSI SSD, queue depth 32. I use default setting. fio script
>> is:
>>
>> [global]
>> ioengine=libaio
>> direct=1
>> readwrite=randread
>> bs=4k
>> runtime=60
>> time_based=1
>> file_service_type=random:36
>> overwrite=1
>> thread=0
>> group_reporting=1
>> filename=/dev/sdb
>> iodepth=1
>> numjobs=8
>>
>> [groupA]
>> prio=2
>>
>> [groupB]
>> new_group
>> prio=6
>>
>> I'll change iodepth, numjobs and prio in different tests. result unit is MB/s.
>>
>> iodepth=1 numjobs=1 prio 4:4
>> CFQ: 28:28 BFQ: 21:21 deadline: 29:29
>>
>> iodepth=8 numjobs=1 prio 4:4
>> CFQ: 162:162 BFQ: 102:98 deadline: 205:205
>>
>> iodepth=1 numjobs=8 prio 4:4
>> CFQ: 157:157 BFQ: 81:92 deadline: 196:197
>>
>> iodepth=1 numjobs=1 prio 2:6
>> CFQ: 26.7:27.6 BFQ: 20:6 deadline: 29:29
>>
>> iodepth=8 numjobs=1 prio 2:6
>> CFQ: 166:174 BFQ: 139:72 deadline: 202:202
>>
>> iodepth=1 numjobs=8 prio 2:6
>> CFQ: 148:150 BFQ: 90:77 deadline: 198:197
>>
>> CFQ isn't fair at all. BFQ is very good in this side, but has poor throughput
>> even prio is the default value.
>>
>
> Throughput is lower with BFQ for two reasons.
>
> First, you certainly left the low_latency in its default state, i.e.,
> on. As explained, e.g., here [2], low_latency mode is totally geared
> towards maximum responsiveness and minimum latency for soft real-time
> applications (e.g., video players). To achieve this goal, BFQ is
> willing to perform more idling, when necessary. This lowers
> throughput (I'll get back on this at the end of the discussion of the
> second reason).
>
> The second, most important reason, is that a minimum of idling is the
> *only* way to achieve differentiated bandwidth distribution, as you
> requested by setting different ioprios. I stress that this constraint
> is not a technological accident, but a intrinsic, logical necessity.
> The proof is simple, and if the following explanation is too boring or
> confusing, I can show it to you with any trace of sync I/O.
>
> First, to provide differentiated service, you need per-process
> scheduling, i.e., schedulers in which there is a separate queue
> associated with each process. Now, let A be the process with higher
> weight (ioprio), and B the process with lower weight. Both processes
> are sync, thus, by definition, they issue requests as follows: a few
> requests (probably two, or a little bit more with larger iodepth),
> then a little break to wait for request completion, then the next
> small batch and so on. For each process, the queue associated with
> the process (in the scheduler) is necessarily empty on the break. As
> a consequence, if there is no idling, then every time A reaches its
> break, the scheduler has only the option to switch to B (which is
> extremely likely to have pending requests).
>
> The service pattern of the processes then unavoidably becomes:
>
> A B A B A B ...
>
> where each letter represents a full small batch served for the
> process. That is, 50% of the bw for each process, and complete loss
> of control on the desired bandwidth distribution.
>
> So, to sum up, the reason why BFQ achieves a lower total bw is that it
> behaves in the only correct way to respect weights with sync I/O,
> i.e., it performs a little idling. If low_latency is on, then BFQ
> increases idling further, and this may be have caused further bw loss
> in your test (but this varies greatly with devices, so you can
> discover it only by trying).
>
> The bottom line is that if you do want to achieve differentiation with
> sync I/O, you have to pay a price in terms of bw, because of idling.
> Actually, the recent preemption mechanism that I have introduced in
> BFQ is proving so effective in preserving differentiation, that I'm
> tempted to try some almost idleness solution. A little of accuracy
> should however be sacrificed. Anyway, this is still work in progress.
>

Just for completeness, if the weights of the processes or groups are
equal, then BFQ preserves service guarantees while achieving the same
bw as deadline. And equal weights is the most common case according
to my limited experience.

Thanks,
Paolo

> Thank you very much,
> Paolo
>
> [2] http://algogroup.unimore.it/people/paolo/disk_sched/description.php
>
>> Thanks,
>> Shaohua
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-block" in
>> the body of a message to [email protected]
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
>
> --
> Paolo Valente
> Algogroup
> Dipartimento di Scienze Fisiche, Informatiche e Matematiche
> Via Campi 213/B
> 41125 Modena - Italy
> http://algogroup.unimore.it/people/paolo/
>
>
>
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-block" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html


--
Paolo Valente
Algogroup
Dipartimento di Scienze Fisiche, Informatiche e Matematiche
Via Campi 213/B
41125 Modena - Italy
http://algogroup.unimore.it/people/paolo/





2016-10-05 20:36:55

by Shaohua Li

[permalink] [raw]
Subject: Re: [PATCH V3 00/11] block-throttle: add .high limit

On Wed, Oct 05, 2016 at 09:57:22PM +0200, Paolo Valente wrote:
>
> > Il giorno 05 ott 2016, alle ore 21:08, Shaohua Li <[email protected]> ha scritto:
> >
> > On Wed, Oct 05, 2016 at 11:30:53AM -0700, Shaohua Li wrote:
> >> On Wed, Oct 05, 2016 at 10:49:46AM -0400, Tejun Heo wrote:
> >>> Hello, Paolo.
> >>>
> >>> On Wed, Oct 05, 2016 at 02:37:00PM +0200, Paolo Valente wrote:
> >>>> In this respect, for your generic, unpredictable scenario to make
> >>>> sense, there must exist at least one real system that meets the
> >>>> requirements of such a scenario. Or, if such a real system does not
> >>>> yet exist, it must be possible to emulate it. If it is impossible to
> >>>> achieve this last goal either, then I miss the usefulness
> >>>> of looking for solutions for such a scenario.
> >>>>
> >>>> That said, let's define the instance(s) of the scenario that you find
> >>>> most representative, and let's test BFQ on it/them. Numbers will give
> >>>> us the answers. For example, what about all or part of the following
> >>>> groups:
> >>>> . one cyclically doing random I/O for some second and then sequential I/O
> >>>> for the next seconds
> >>>> . one doing, say, quasi-sequential I/O in ON/OFF cycles
> >>>> . one starting an application cyclically
> >>>> . one playing back or streaming a movie
> >>>>
> >>>> For each group, we could then measure the time needed to complete each
> >>>> phase of I/O in each cycle, plus the responsiveness in the group
> >>>> starting an application, plus the frame drop in the group streaming
> >>>> the movie. In addition, we can measure the bandwidth/iops enjoyed by
> >>>> each group, plus, of course, the aggregate throughput of the whole
> >>>> system. In particular we could compare results with throttling, BFQ,
> >>>> and CFQ.
> >>>>
> >>>> Then we could write resulting numbers on the stone, and stick to them
> >>>> until something proves them wrong.
> >>>>
> >>>> What do you (or others) think about it?
> >>>
> >>> That sounds great and yeah it's lame that we didn't start with that.
> >>> Shaohua, would it be difficult to compare how bfq performs against
> >>> blk-throttle?
> >>
> >> I had a test of BFQ. I'm using BFQ found at
> >> https://urldefense.proofpoint.com/v2/url?u=http-3A__algogroup.unimore.it_people_paolo_disk-5Fsched_sources.php&d=DQIFAg&c=5VD0RTtNlTh3ycd41b3MUw&r=X13hAPkxmvBro1Ug8vcKHw&m=zB09S7v2QifXXTa6f2_r6YLjiXq3AwAi7sqO4o2UfBQ&s=oMKpjQMXfWmMwHmANB-Qnrm2EdERzz9Oef7jcLkbyFg&e= . version is
> >> 4.7.0-v8r3. It's a LSI SSD, queue depth 32. I use default setting. fio script
> >> is:
> >>
> >> [global]
> >> ioengine=libaio
> >> direct=1
> >> readwrite=randread
> >> bs=4k
> >> runtime=60
> >> time_based=1
> >> file_service_type=random:36
> >> overwrite=1
> >> thread=0
> >> group_reporting=1
> >> filename=/dev/sdb
> >> iodepth=1
> >> numjobs=8
> >>
> >> [groupA]
> >> prio=2
> >>
> >> [groupB]
> >> new_group
> >> prio=6
> >>
> >> I'll change iodepth, numjobs and prio in different tests. result unit is MB/s.
> >>
> >> iodepth=1 numjobs=1 prio 4:4
> >> CFQ: 28:28 BFQ: 21:21 deadline: 29:29
> >>
> >> iodepth=8 numjobs=1 prio 4:4
> >> CFQ: 162:162 BFQ: 102:98 deadline: 205:205
> >>
> >> iodepth=1 numjobs=8 prio 4:4
> >> CFQ: 157:157 BFQ: 81:92 deadline: 196:197
> >>
> >> iodepth=1 numjobs=1 prio 2:6
> >> CFQ: 26.7:27.6 BFQ: 20:6 deadline: 29:29
> >>
> >> iodepth=8 numjobs=1 prio 2:6
> >> CFQ: 166:174 BFQ: 139:72 deadline: 202:202
> >>
> >> iodepth=1 numjobs=8 prio 2:6
> >> CFQ: 148:150 BFQ: 90:77 deadline: 198:197
> >
> > More tests:
> >
> > iodepth=8 numjobs=1 prio 2:6, group A has 50M/s limit
> > CFQ:51:207 BFQ: 51:45 deadline: 51:216
> >
> > iodepth=1 numjobs=1 prio 2:6, group A bs=4k, group B bs=64k
> > CFQ:25:249 BFQ: 23:42 deadline: 26:251
> >
>
> A true proportional share scheduler like BFQ works under the
> assumption to be the only limiter of the bandwidth of its clients.
> And the availability of such a scheduler should apparently make
> bandwidth limiting useless: once you have a mechanism that allows you
> to give each group the desired fraction of the bandwidth, and to
> redistribute excess bandwidth seamlessly when needed, what do you need
> additional limiting for?
>
> But I'm not expert of any possible system configuration or
> requirement. So, if you have practical examples, I would really
> appreciate them. And I don't think it will be difficult to see what
> goes wrong in BFQ with external bw limitation, and to fix the
> problem.

I think the test emulates a very common configuration. We assign more IO
resources to high priority workload. But such workload doesn't always dispatch
enough io. That's why I set a rate limit. When this happend, we hope low
priority workload uses the disk bandwidth. That's the whole point of disk
sharing.

Thanks,
Shaohua

2016-10-05 20:46:28

by Shaohua Li

[permalink] [raw]
Subject: Re: [PATCH V3 00/11] block-throttle: add .high limit

On Wed, Oct 05, 2016 at 09:47:19PM +0200, Paolo Valente wrote:
>
> > Il giorno 05 ott 2016, alle ore 20:30, Shaohua Li <[email protected]> ha scritto:
> >
> > On Wed, Oct 05, 2016 at 10:49:46AM -0400, Tejun Heo wrote:
> >> Hello, Paolo.
> >>
> >> On Wed, Oct 05, 2016 at 02:37:00PM +0200, Paolo Valente wrote:
> >>> In this respect, for your generic, unpredictable scenario to make
> >>> sense, there must exist at least one real system that meets the
> >>> requirements of such a scenario. Or, if such a real system does not
> >>> yet exist, it must be possible to emulate it. If it is impossible to
> >>> achieve this last goal either, then I miss the usefulness
> >>> of looking for solutions for such a scenario.
> >>>
> >>> That said, let's define the instance(s) of the scenario that you find
> >>> most representative, and let's test BFQ on it/them. Numbers will give
> >>> us the answers. For example, what about all or part of the following
> >>> groups:
> >>> . one cyclically doing random I/O for some second and then sequential I/O
> >>> for the next seconds
> >>> . one doing, say, quasi-sequential I/O in ON/OFF cycles
> >>> . one starting an application cyclically
> >>> . one playing back or streaming a movie
> >>>
> >>> For each group, we could then measure the time needed to complete each
> >>> phase of I/O in each cycle, plus the responsiveness in the group
> >>> starting an application, plus the frame drop in the group streaming
> >>> the movie. In addition, we can measure the bandwidth/iops enjoyed by
> >>> each group, plus, of course, the aggregate throughput of the whole
> >>> system. In particular we could compare results with throttling, BFQ,
> >>> and CFQ.
> >>>
> >>> Then we could write resulting numbers on the stone, and stick to them
> >>> until something proves them wrong.
> >>>
> >>> What do you (or others) think about it?
> >>
> >> That sounds great and yeah it's lame that we didn't start with that.
> >> Shaohua, would it be difficult to compare how bfq performs against
> >> blk-throttle?
> >
> > I had a test of BFQ.
>
> Thank you very much for testing BFQ!
>
> > I'm using BFQ found at
> > https://urldefense.proofpoint.com/v2/url?u=http-3A__algogroup.unimore.it_people_paolo_disk-5Fsched_sources.php&d=DQIFAg&c=5VD0RTtNlTh3ycd41b3MUw&r=i6WobKxbeG3slzHSIOxTVtYIJw7qjCE6S0spDTKL-J4&m=2pG8KEx5tRymExa_K0ddKH_YvhH3qvJxELBd1_lw0-w&s=FZKEAOu2sw95y9jZio2k012cQWoLzlBWDl0NiGPVW78&e= . version is
> > 4.7.0-v8r3.
>
> That's the latest stable version. The development version [1] already
> contains further improvements for fairness, latency and throughput.
> It is however still a release candidate.
>
> [1] https://github.com/linusw/linux-bfq/tree/bfq-v8
>
> > It's a LSI SSD, queue depth 32. I use default setting. fio script
> > is:
> >
> > [global]
> > ioengine=libaio
> > direct=1
> > readwrite=randread
> > bs=4k
> > runtime=60
> > time_based=1
> > file_service_type=random:36
> > overwrite=1
> > thread=0
> > group_reporting=1
> > filename=/dev/sdb
> > iodepth=1
> > numjobs=8
> >
> > [groupA]
> > prio=2
> >
> > [groupB]
> > new_group
> > prio=6
> >
> > I'll change iodepth, numjobs and prio in different tests. result unit is MB/s.
> >
> > iodepth=1 numjobs=1 prio 4:4
> > CFQ: 28:28 BFQ: 21:21 deadline: 29:29
> >
> > iodepth=8 numjobs=1 prio 4:4
> > CFQ: 162:162 BFQ: 102:98 deadline: 205:205
> >
> > iodepth=1 numjobs=8 prio 4:4
> > CFQ: 157:157 BFQ: 81:92 deadline: 196:197
> >
> > iodepth=1 numjobs=1 prio 2:6
> > CFQ: 26.7:27.6 BFQ: 20:6 deadline: 29:29
> >
> > iodepth=8 numjobs=1 prio 2:6
> > CFQ: 166:174 BFQ: 139:72 deadline: 202:202
> >
> > iodepth=1 numjobs=8 prio 2:6
> > CFQ: 148:150 BFQ: 90:77 deadline: 198:197
> >
> > CFQ isn't fair at all. BFQ is very good in this side, but has poor throughput
> > even prio is the default value.
> >
>
> Throughput is lower with BFQ for two reasons.
>
> First, you certainly left the low_latency in its default state, i.e.,
> on. As explained, e.g., here [2], low_latency mode is totally geared
> towards maximum responsiveness and minimum latency for soft real-time
> applications (e.g., video players). To achieve this goal, BFQ is
> willing to perform more idling, when necessary. This lowers
> throughput (I'll get back on this at the end of the discussion of the
> second reason).

changing low_latency to 0 seems not change anything, at least for the test:
iodepth=1 numjobs=1 prio 2:6 A bs 4k:64k

> The second, most important reason, is that a minimum of idling is the
> *only* way to achieve differentiated bandwidth distribution, as you
> requested by setting different ioprios. I stress that this constraint
> is not a technological accident, but a intrinsic, logical necessity.
> The proof is simple, and if the following explanation is too boring or
> confusing, I can show it to you with any trace of sync I/O.
>
> First, to provide differentiated service, you need per-process
> scheduling, i.e., schedulers in which there is a separate queue
> associated with each process. Now, let A be the process with higher
> weight (ioprio), and B the process with lower weight. Both processes
> are sync, thus, by definition, they issue requests as follows: a few
> requests (probably two, or a little bit more with larger iodepth),
> then a little break to wait for request completion, then the next
> small batch and so on. For each process, the queue associated with
> the process (in the scheduler) is necessarily empty on the break. As
> a consequence, if there is no idling, then every time A reaches its
> break, the scheduler has only the option to switch to B (which is
> extremely likely to have pending requests).
>
> The service pattern of the processes then unavoidably becomes:
>
> A B A B A B ...
>
> where each letter represents a full small batch served for the
> process. That is, 50% of the bw for each process, and complete loss
> of control on the desired bandwidth distribution.
>
> So, to sum up, the reason why BFQ achieves a lower total bw is that it
> behaves in the only correct way to respect weights with sync I/O,
> i.e., it performs a little idling. If low_latency is on, then BFQ
> increases idling further, and this may be have caused further bw loss
> in your test (but this varies greatly with devices, so you can
> discover it only by trying).
>
> The bottom line is that if you do want to achieve differentiation with
> sync I/O, you have to pay a price in terms of bw, because of idling.
> Actually, the recent preemption mechanism that I have introduced in
> BFQ is proving so effective in preserving differentiation, that I'm
> tempted to try some almost idleness solution. A little of accuracy
> should however be sacrificed. Anyway, this is still work in progress.

Yep, I fully understand why idle is required here. As long as workload io depth
is lower than queue io depth, idle is the only way to maintain fairness. This
is the core of CFQ, I bet the same for BFQ. Unfortunately idle disk harms
throughput too much especially for high end SSD.

Thanks,
Shaohua

2016-10-06 07:22:20

by Paolo Valente

[permalink] [raw]
Subject: Re: [PATCH V3 00/11] block-throttle: add .high limit


> Il giorno 05 ott 2016, alle ore 22:36, Shaohua Li <[email protected]> ha scritto:
>
> On Wed, Oct 05, 2016 at 09:57:22PM +0200, Paolo Valente wrote:
>>
>>> Il giorno 05 ott 2016, alle ore 21:08, Shaohua Li <[email protected]> ha scritto:
>>>
>>> On Wed, Oct 05, 2016 at 11:30:53AM -0700, Shaohua Li wrote:
>>>> On Wed, Oct 05, 2016 at 10:49:46AM -0400, Tejun Heo wrote:
>>>>> Hello, Paolo.
>>>>>
>>>>> On Wed, Oct 05, 2016 at 02:37:00PM +0200, Paolo Valente wrote:
>>>>>> In this respect, for your generic, unpredictable scenario to make
>>>>>> sense, there must exist at least one real system that meets the
>>>>>> requirements of such a scenario. Or, if such a real system does not
>>>>>> yet exist, it must be possible to emulate it. If it is impossible to
>>>>>> achieve this last goal either, then I miss the usefulness
>>>>>> of looking for solutions for such a scenario.
>>>>>>
>>>>>> That said, let's define the instance(s) of the scenario that you find
>>>>>> most representative, and let's test BFQ on it/them. Numbers will give
>>>>>> us the answers. For example, what about all or part of the following
>>>>>> groups:
>>>>>> . one cyclically doing random I/O for some second and then sequential I/O
>>>>>> for the next seconds
>>>>>> . one doing, say, quasi-sequential I/O in ON/OFF cycles
>>>>>> . one starting an application cyclically
>>>>>> . one playing back or streaming a movie
>>>>>>
>>>>>> For each group, we could then measure the time needed to complete each
>>>>>> phase of I/O in each cycle, plus the responsiveness in the group
>>>>>> starting an application, plus the frame drop in the group streaming
>>>>>> the movie. In addition, we can measure the bandwidth/iops enjoyed by
>>>>>> each group, plus, of course, the aggregate throughput of the whole
>>>>>> system. In particular we could compare results with throttling, BFQ,
>>>>>> and CFQ.
>>>>>>
>>>>>> Then we could write resulting numbers on the stone, and stick to them
>>>>>> until something proves them wrong.
>>>>>>
>>>>>> What do you (or others) think about it?
>>>>>
>>>>> That sounds great and yeah it's lame that we didn't start with that.
>>>>> Shaohua, would it be difficult to compare how bfq performs against
>>>>> blk-throttle?
>>>>
>>>> I had a test of BFQ. I'm using BFQ found at
>>>> https://urldefense.proofpoint.com/v2/url?u=http-3A__algogroup.unimore.it_people_paolo_disk-5Fsched_sources.php&d=DQIFAg&c=5VD0RTtNlTh3ycd41b3MUw&r=X13hAPkxmvBro1Ug8vcKHw&m=zB09S7v2QifXXTa6f2_r6YLjiXq3AwAi7sqO4o2UfBQ&s=oMKpjQMXfWmMwHmANB-Qnrm2EdERzz9Oef7jcLkbyFg&e= . version is
>>>> 4.7.0-v8r3. It's a LSI SSD, queue depth 32. I use default setting. fio script
>>>> is:
>>>>
>>>> [global]
>>>> ioengine=libaio
>>>> direct=1
>>>> readwrite=randread
>>>> bs=4k
>>>> runtime=60
>>>> time_based=1
>>>> file_service_type=random:36
>>>> overwrite=1
>>>> thread=0
>>>> group_reporting=1
>>>> filename=/dev/sdb
>>>> iodepth=1
>>>> numjobs=8
>>>>
>>>> [groupA]
>>>> prio=2
>>>>
>>>> [groupB]
>>>> new_group
>>>> prio=6
>>>>
>>>> I'll change iodepth, numjobs and prio in different tests. result unit is MB/s.
>>>>
>>>> iodepth=1 numjobs=1 prio 4:4
>>>> CFQ: 28:28 BFQ: 21:21 deadline: 29:29
>>>>
>>>> iodepth=8 numjobs=1 prio 4:4
>>>> CFQ: 162:162 BFQ: 102:98 deadline: 205:205
>>>>
>>>> iodepth=1 numjobs=8 prio 4:4
>>>> CFQ: 157:157 BFQ: 81:92 deadline: 196:197
>>>>
>>>> iodepth=1 numjobs=1 prio 2:6
>>>> CFQ: 26.7:27.6 BFQ: 20:6 deadline: 29:29
>>>>
>>>> iodepth=8 numjobs=1 prio 2:6
>>>> CFQ: 166:174 BFQ: 139:72 deadline: 202:202
>>>>
>>>> iodepth=1 numjobs=8 prio 2:6
>>>> CFQ: 148:150 BFQ: 90:77 deadline: 198:197
>>>
>>> More tests:
>>>
>>> iodepth=8 numjobs=1 prio 2:6, group A has 50M/s limit
>>> CFQ:51:207 BFQ: 51:45 deadline: 51:216
>>>
>>> iodepth=1 numjobs=1 prio 2:6, group A bs=4k, group B bs=64k
>>> CFQ:25:249 BFQ: 23:42 deadline: 26:251
>>>
>>
>> A true proportional share scheduler like BFQ works under the
>> assumption to be the only limiter of the bandwidth of its clients.
>> And the availability of such a scheduler should apparently make
>> bandwidth limiting useless: once you have a mechanism that allows you
>> to give each group the desired fraction of the bandwidth, and to
>> redistribute excess bandwidth seamlessly when needed, what do you need
>> additional limiting for?
>>
>> But I'm not expert of any possible system configuration or
>> requirement. So, if you have practical examples, I would really
>> appreciate them. And I don't think it will be difficult to see what
>> goes wrong in BFQ with external bw limitation, and to fix the
>> problem.
>
> I think the test emulates a very common configuration. We assign more IO
> resources to high priority workload. But such workload doesn't always dispatch
> enough io. That's why I set a rate limit. When this happend, we hope low
> priority workload uses the disk bandwidth. That's the whole point of disk
> sharing.
>

But that's exactly the configuration for which a proportional-share
scheduler is designed: systematically and seamlessly redistribute
excess bw, with no configuration needed. Or is there something else
in the scenario you have in mind?

Thanks,
Paolo

> Thanks,
> Shaohua
> --
> To unsubscribe from this list: send the line "unsubscribe linux-block" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html


--
Paolo Valente
Algogroup
Dipartimento di Scienze Fisiche, Informatiche e Matematiche
Via Campi 213/B
41125 Modena - Italy
http://algogroup.unimore.it/people/paolo/





2016-10-06 07:59:02

by Paolo Valente

[permalink] [raw]
Subject: Re: [PATCH V3 00/11] block-throttle: add .high limit


> Il giorno 05 ott 2016, alle ore 22:46, Shaohua Li <[email protected]> ha scritto:
>
> On Wed, Oct 05, 2016 at 09:47:19PM +0200, Paolo Valente wrote:
>>
>>> Il giorno 05 ott 2016, alle ore 20:30, Shaohua Li <[email protected]> ha scritto:
>>>
>>> On Wed, Oct 05, 2016 at 10:49:46AM -0400, Tejun Heo wrote:
>>>> Hello, Paolo.
>>>>
>>>> On Wed, Oct 05, 2016 at 02:37:00PM +0200, Paolo Valente wrote:
>>>>> In this respect, for your generic, unpredictable scenario to make
>>>>> sense, there must exist at least one real system that meets the
>>>>> requirements of such a scenario. Or, if such a real system does not
>>>>> yet exist, it must be possible to emulate it. If it is impossible to
>>>>> achieve this last goal either, then I miss the usefulness
>>>>> of looking for solutions for such a scenario.
>>>>>
>>>>> That said, let's define the instance(s) of the scenario that you find
>>>>> most representative, and let's test BFQ on it/them. Numbers will give
>>>>> us the answers. For example, what about all or part of the following
>>>>> groups:
>>>>> . one cyclically doing random I/O for some second and then sequential I/O
>>>>> for the next seconds
>>>>> . one doing, say, quasi-sequential I/O in ON/OFF cycles
>>>>> . one starting an application cyclically
>>>>> . one playing back or streaming a movie
>>>>>
>>>>> For each group, we could then measure the time needed to complete each
>>>>> phase of I/O in each cycle, plus the responsiveness in the group
>>>>> starting an application, plus the frame drop in the group streaming
>>>>> the movie. In addition, we can measure the bandwidth/iops enjoyed by
>>>>> each group, plus, of course, the aggregate throughput of the whole
>>>>> system. In particular we could compare results with throttling, BFQ,
>>>>> and CFQ.
>>>>>
>>>>> Then we could write resulting numbers on the stone, and stick to them
>>>>> until something proves them wrong.
>>>>>
>>>>> What do you (or others) think about it?
>>>>
>>>> That sounds great and yeah it's lame that we didn't start with that.
>>>> Shaohua, would it be difficult to compare how bfq performs against
>>>> blk-throttle?
>>>
>>> I had a test of BFQ.
>>
>> Thank you very much for testing BFQ!
>>
>>> I'm using BFQ found at
>>> https://urldefense.proofpoint.com/v2/url?u=http-3A__algogroup.unimore.it_people_paolo_disk-5Fsched_sources.php&d=DQIFAg&c=5VD0RTtNlTh3ycd41b3MUw&r=i6WobKxbeG3slzHSIOxTVtYIJw7qjCE6S0spDTKL-J4&m=2pG8KEx5tRymExa_K0ddKH_YvhH3qvJxELBd1_lw0-w&s=FZKEAOu2sw95y9jZio2k012cQWoLzlBWDl0NiGPVW78&e= . version is
>>> 4.7.0-v8r3.
>>
>> That's the latest stable version. The development version [1] already
>> contains further improvements for fairness, latency and throughput.
>> It is however still a release candidate.
>>
>> [1] https://github.com/linusw/linux-bfq/tree/bfq-v8
>>
>>> It's a LSI SSD, queue depth 32. I use default setting. fio script
>>> is:
>>>
>>> [global]
>>> ioengine=libaio
>>> direct=1
>>> readwrite=randread
>>> bs=4k
>>> runtime=60
>>> time_based=1
>>> file_service_type=random:36
>>> overwrite=1
>>> thread=0
>>> group_reporting=1
>>> filename=/dev/sdb
>>> iodepth=1
>>> numjobs=8
>>>
>>> [groupA]
>>> prio=2
>>>
>>> [groupB]
>>> new_group
>>> prio=6
>>>
>>> I'll change iodepth, numjobs and prio in different tests. result unit is MB/s.
>>>
>>> iodepth=1 numjobs=1 prio 4:4
>>> CFQ: 28:28 BFQ: 21:21 deadline: 29:29
>>>
>>> iodepth=8 numjobs=1 prio 4:4
>>> CFQ: 162:162 BFQ: 102:98 deadline: 205:205
>>>
>>> iodepth=1 numjobs=8 prio 4:4
>>> CFQ: 157:157 BFQ: 81:92 deadline: 196:197
>>>
>>> iodepth=1 numjobs=1 prio 2:6
>>> CFQ: 26.7:27.6 BFQ: 20:6 deadline: 29:29
>>>
>>> iodepth=8 numjobs=1 prio 2:6
>>> CFQ: 166:174 BFQ: 139:72 deadline: 202:202
>>>
>>> iodepth=1 numjobs=8 prio 2:6
>>> CFQ: 148:150 BFQ: 90:77 deadline: 198:197
>>>
>>> CFQ isn't fair at all. BFQ is very good in this side, but has poor throughput
>>> even prio is the default value.
>>>
>>
>> Throughput is lower with BFQ for two reasons.
>>
>> First, you certainly left the low_latency in its default state, i.e.,
>> on. As explained, e.g., here [2], low_latency mode is totally geared
>> towards maximum responsiveness and minimum latency for soft real-time
>> applications (e.g., video players). To achieve this goal, BFQ is
>> willing to perform more idling, when necessary. This lowers
>> throughput (I'll get back on this at the end of the discussion of the
>> second reason).
>
> changing low_latency to 0 seems not change anything, at least for the test:
> iodepth=1 numjobs=1 prio 2:6 A bs 4k:64k
>
>> The second, most important reason, is that a minimum of idling is the
>> *only* way to achieve differentiated bandwidth distribution, as you
>> requested by setting different ioprios. I stress that this constraint
>> is not a technological accident, but a intrinsic, logical necessity.
>> The proof is simple, and if the following explanation is too boring or
>> confusing, I can show it to you with any trace of sync I/O.
>>
>> First, to provide differentiated service, you need per-process
>> scheduling, i.e., schedulers in which there is a separate queue
>> associated with each process. Now, let A be the process with higher
>> weight (ioprio), and B the process with lower weight. Both processes
>> are sync, thus, by definition, they issue requests as follows: a few
>> requests (probably two, or a little bit more with larger iodepth),
>> then a little break to wait for request completion, then the next
>> small batch and so on. For each process, the queue associated with
>> the process (in the scheduler) is necessarily empty on the break. As
>> a consequence, if there is no idling, then every time A reaches its
>> break, the scheduler has only the option to switch to B (which is
>> extremely likely to have pending requests).
>>
>> The service pattern of the processes then unavoidably becomes:
>>
>> A B A B A B ...
>>
>> where each letter represents a full small batch served for the
>> process. That is, 50% of the bw for each process, and complete loss
>> of control on the desired bandwidth distribution.
>>
>> So, to sum up, the reason why BFQ achieves a lower total bw is that it
>> behaves in the only correct way to respect weights with sync I/O,
>> i.e., it performs a little idling. If low_latency is on, then BFQ
>> increases idling further, and this may be have caused further bw loss
>> in your test (but this varies greatly with devices, so you can
>> discover it only by trying).
>>
>> The bottom line is that if you do want to achieve differentiation with
>> sync I/O, you have to pay a price in terms of bw, because of idling.
>> Actually, the recent preemption mechanism that I have introduced in
>> BFQ is proving so effective in preserving differentiation, that I'm
>> tempted to try some almost idleness solution. A little of accuracy
>> should however be sacrificed. Anyway, this is still work in progress.
>
> Yep, I fully understand why idle is required here. As long as workload io depth
> is lower than queue io depth, idle is the only way to maintain fairness. This
> is the core of CFQ, I bet the same for BFQ. Unfortunately idle disk harms
> throughput too much especially for high end SSD.
>

Then I'm afraid I have to give you very bad news: bw limiting causes
the same throughput loss. You can see it from your very tests. Here
is one of your results with BFQ (one that is likely to have been
affected less by the fact that you left low_latency on, or by further
issues that I may have not yet addressed thoroughly):

iodepth=8 numjobs=1 prio 2:6
CFQ: 166:174 BFQ: 139:72 deadline: 202:202

Here is, instead, your test with bw limitation:
iodepth=8 numjobs=1 prio 2:6, group A has 50M/s limit
CFQ:51:207 BFQ: 51:45 deadline: 51:216

>From the first test, you see that the total bw achievable by the
device is at least 404MB/s. But in the second test you get at most
267MB/s, with deadline. In this respect, the total bw achieved by BFQ
in the first test is 211MB/s.

So, both throttling and proportional share need to waste bw, BFQ
looses about 13% more of the total bw. In return, it gives you
incomparably better bw and latency guarantees, while allowing you to
configure your system with zero or minimal effort. In contrast, using
bw limits to properly configure a common system like, e.g., a large
file server, may become a nightmare for a sysadmin. For example, if,
in the simplest case, she/he configures limits for the worst-case,
then per-client limits will have to be extremely low. But the system
is large and dynamic, so the actual number of clients and at the
actual bw consumed by each client will vary without a break, even in
the short-medium term. The bw redistribution heuristic do not give
any provable guarantees on accuracy of bw redistribution. The result
will likely be highly varying client bandwidths, with unlucky clients
unjustly limited to low limits, and then experiencing high latencies.
The latter will be further emphasized by the intrinsically bursty
nature of throttling.

In addition, the scenario in your tests is the worst case for a
proportional share solution: in a generic system, such as the the file
server above, part of the workload is likely to be sequential or
quasi-sequential (at least in medium-length time intervals), and this
is enough to get very close to peak bw with a proportional-share
scheduler. No configuration needed. With bw throttling, you must do
the math very well to get peak bw all the time in a dynamic system.

To sum up, I do think that bw throttling is the best possible solution
(and in particular your solution is very good) till blk-mq lacks an
accurate scheduler. But, where you do have such a scheduler, it makes
very little sense to give users a very bad service, for fear of
wasting at most 13% of the bw in a single worst case.

Thanks,
Paolo


> Thanks,
> Shaohua
> --
> To unsubscribe from this list: send the line "unsubscribe linux-block" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html


--
Paolo Valente
Algogroup
Dipartimento di Scienze Fisiche, Informatiche e Matematiche
Via Campi 213/B
41125 Modena - Italy
http://algogroup.unimore.it/people/paolo/





2016-10-06 08:04:48

by Linus Walleij

[permalink] [raw]
Subject: Re: [PATCH V3 00/11] block-throttle: add .high limit

On Tue, Oct 4, 2016 at 9:14 PM, Tejun Heo <[email protected]> wrote:

> I get that bfq can be a good compromise on most desktop workloads and
> behave reasonably well for some server workloads with the slice
> expiration mechanism but it really isn't an IO resource partitioning
> mechanism.

Not just desktops, also Android phones.

So why not have BFQ as a separate scheduling policy upstream,
alongside CFQ, deadline and noop?

I understand the CPU scheduler people's position that they want
one scheduler for everyone's everyday loads (except RT and
SCHED_DEADLINE) and I guess that is the source of the highlander
"there can be only one" argument, but note this:

kernel/Kconfig.preempt:

config PREEMPT_NONE
bool "No Forced Preemption (Server)"
config PREEMPT_VOLUNTARY
bool "Voluntary Kernel Preemption (Desktop)"
config PREEMPT
bool "Preemptible Kernel (Low-Latency Desktop)"

We're already doing the per-usecase Kconfig thing for preemption.
But maybe somebody already hates that and want to get rid of it,
I don't know.

Yours,
Linus Walleij

2016-10-06 11:04:08

by Mark Brown

[permalink] [raw]
Subject: Re: [PATCH V3 00/11] block-throttle: add .high limit

On Thu, Oct 06, 2016 at 10:04:41AM +0200, Linus Walleij wrote:
> On Tue, Oct 4, 2016 at 9:14 PM, Tejun Heo <[email protected]> wrote:

> > I get that bfq can be a good compromise on most desktop workloads and
> > behave reasonably well for some server workloads with the slice
> > expiration mechanism but it really isn't an IO resource partitioning
> > mechanism.

> Not just desktops, also Android phones.

> So why not have BFQ as a separate scheduling policy upstream,
> alongside CFQ, deadline and noop?

Right.

> We're already doing the per-usecase Kconfig thing for preemption.
> But maybe somebody already hates that and want to get rid of it,
> I don't know.

Hannes also suggested going back to making BFQ a separate scheduler
rather than replacing CFQ earlier, pointing out that it mitigates
against the risks of changing CFQ substantially at this point (which
seems to be the biggest issue here).


Attachments:
(No filename) (906.00 B)
signature.asc (455.00 B)
Download all attachments

2016-10-06 11:57:58

by Austin S Hemmelgarn

[permalink] [raw]
Subject: Re: [PATCH V3 00/11] block-throttle: add .high limit

On 2016-10-06 07:03, Mark Brown wrote:
> On Thu, Oct 06, 2016 at 10:04:41AM +0200, Linus Walleij wrote:
>> On Tue, Oct 4, 2016 at 9:14 PM, Tejun Heo <[email protected]> wrote:
>
>>> I get that bfq can be a good compromise on most desktop workloads and
>>> behave reasonably well for some server workloads with the slice
>>> expiration mechanism but it really isn't an IO resource partitioning
>>> mechanism.
>
>> Not just desktops, also Android phones.
>
>> So why not have BFQ as a separate scheduling policy upstream,
>> alongside CFQ, deadline and noop?
>
> Right.
>
>> We're already doing the per-usecase Kconfig thing for preemption.
>> But maybe somebody already hates that and want to get rid of it,
>> I don't know.
>
> Hannes also suggested going back to making BFQ a separate scheduler
> rather than replacing CFQ earlier, pointing out that it mitigates
> against the risks of changing CFQ substantially at this point (which
> seems to be the biggest issue here).
>
ISTR that the original argument for this approach essentially amounted
to: 'If it's so much better, why do we need both?'.

Such an argument is valid only if the new design is better in all
respects (which there isn't sufficient information to decide in this
case), or the negative aspects are worth the improvements (which is too
workload specific to decide for something like this).

2016-10-06 12:50:32

by Paolo Valente

[permalink] [raw]
Subject: Re: [PATCH V3 00/11] block-throttle: add .high limit


> Il giorno 06 ott 2016, alle ore 13:57, Austin S. Hemmelgarn <[email protected]> ha scritto:
>
> On 2016-10-06 07:03, Mark Brown wrote:
>> On Thu, Oct 06, 2016 at 10:04:41AM +0200, Linus Walleij wrote:
>>> On Tue, Oct 4, 2016 at 9:14 PM, Tejun Heo <[email protected]> wrote:
>>
>>>> I get that bfq can be a good compromise on most desktop workloads and
>>>> behave reasonably well for some server workloads with the slice
>>>> expiration mechanism but it really isn't an IO resource partitioning
>>>> mechanism.
>>
>>> Not just desktops, also Android phones.
>>
>>> So why not have BFQ as a separate scheduling policy upstream,
>>> alongside CFQ, deadline and noop?
>>
>> Right.
>>
>>> We're already doing the per-usecase Kconfig thing for preemption.
>>> But maybe somebody already hates that and want to get rid of it,
>>> I don't know.
>>
>> Hannes also suggested going back to making BFQ a separate scheduler
>> rather than replacing CFQ earlier, pointing out that it mitigates
>> against the risks of changing CFQ substantially at this point (which
>> seems to be the biggest issue here).
>>
> ISTR that the original argument for this approach essentially amounted to: 'If it's so much better, why do we need both?'.
>
> Such an argument is valid only if the new design is better in all respects (which there isn't sufficient information to decide in this case), or the negative aspects are worth the improvements (which is too workload specific to decide for something like this).

All correct, apart from the workload-specific issue, which is not very clear to me. Over the last five years I have not found a single workload for which CFQ is better than BFQ, and none has been suggested.

Anyway, leaving aside this fact, IMO the real problem here is that we are in a catch-22: "we want BFQ to replace CFQ, but, since CFQ is legacy code, then you cannot change, and thus replace, CFQ"

Thanks,
Paolo

--
Paolo Valente
Algogroup
Dipartimento di Scienze Fisiche, Informatiche e Matematiche
Via Campi 213/B
41125 Modena - Italy
http://algogroup.unimore.it/people/paolo/





2016-10-06 13:16:08

by Paolo Valente

[permalink] [raw]
Subject: Re: [PATCH V3 00/11] block-throttle: add .high limit


> Il giorno 06 ott 2016, alle ore 09:58, Paolo Valente <[email protected]> ha scritto:
>
>>
>> Il giorno 05 ott 2016, alle ore 22:46, Shaohua Li <[email protected]> ha scritto:
>>
>> On Wed, Oct 05, 2016 at 09:47:19PM +0200, Paolo Valente wrote:
>>>
>>>> Il giorno 05 ott 2016, alle ore 20:30, Shaohua Li <[email protected]> ha scritto:
>>>>
>>>> On Wed, Oct 05, 2016 at 10:49:46AM -0400, Tejun Heo wrote:
>>>>> Hello, Paolo.
>>>>>
>>>>> On Wed, Oct 05, 2016 at 02:37:00PM +0200, Paolo Valente wrote:
>>>>>> In this respect, for your generic, unpredictable scenario to make
>>>>>> sense, there must exist at least one real system that meets the
>>>>>> requirements of such a scenario. Or, if such a real system does not
>>>>>> yet exist, it must be possible to emulate it. If it is impossible to
>>>>>> achieve this last goal either, then I miss the usefulness
>>>>>> of looking for solutions for such a scenario.
>>>>>>
>>>>>> That said, let's define the instance(s) of the scenario that you find
>>>>>> most representative, and let's test BFQ on it/them. Numbers will give
>>>>>> us the answers. For example, what about all or part of the following
>>>>>> groups:
>>>>>> . one cyclically doing random I/O for some second and then sequential I/O
>>>>>> for the next seconds
>>>>>> . one doing, say, quasi-sequential I/O in ON/OFF cycles
>>>>>> . one starting an application cyclically
>>>>>> . one playing back or streaming a movie
>>>>>>
>>>>>> For each group, we could then measure the time needed to complete each
>>>>>> phase of I/O in each cycle, plus the responsiveness in the group
>>>>>> starting an application, plus the frame drop in the group streaming
>>>>>> the movie. In addition, we can measure the bandwidth/iops enjoyed by
>>>>>> each group, plus, of course, the aggregate throughput of the whole
>>>>>> system. In particular we could compare results with throttling, BFQ,
>>>>>> and CFQ.
>>>>>>
>>>>>> Then we could write resulting numbers on the stone, and stick to them
>>>>>> until something proves them wrong.
>>>>>>
>>>>>> What do you (or others) think about it?
>>>>>
>>>>> That sounds great and yeah it's lame that we didn't start with that.
>>>>> Shaohua, would it be difficult to compare how bfq performs against
>>>>> blk-throttle?
>>>>
>>>> I had a test of BFQ.
>>>
>>> Thank you very much for testing BFQ!
>>>
>>>> I'm using BFQ found at
>>>> https://urldefense.proofpoint.com/v2/url?u=http-3A__algogroup.unimore.it_people_paolo_disk-5Fsched_sources.php&d=DQIFAg&c=5VD0RTtNlTh3ycd41b3MUw&r=i6WobKxbeG3slzHSIOxTVtYIJw7qjCE6S0spDTKL-J4&m=2pG8KEx5tRymExa_K0ddKH_YvhH3qvJxELBd1_lw0-w&s=FZKEAOu2sw95y9jZio2k012cQWoLzlBWDl0NiGPVW78&e= . version is
>>>> 4.7.0-v8r3.
>>>
>>> That's the latest stable version. The development version [1] already
>>> contains further improvements for fairness, latency and throughput.
>>> It is however still a release candidate.
>>>
>>> [1] https://github.com/linusw/linux-bfq/tree/bfq-v8
>>>
>>>> It's a LSI SSD, queue depth 32. I use default setting. fio script
>>>> is:
>>>>
>>>> [global]
>>>> ioengine=libaio
>>>> direct=1
>>>> readwrite=randread
>>>> bs=4k
>>>> runtime=60
>>>> time_based=1
>>>> file_service_type=random:36
>>>> overwrite=1
>>>> thread=0
>>>> group_reporting=1
>>>> filename=/dev/sdb
>>>> iodepth=1
>>>> numjobs=8
>>>>
>>>> [groupA]
>>>> prio=2
>>>>
>>>> [groupB]
>>>> new_group
>>>> prio=6
>>>>
>>>> I'll change iodepth, numjobs and prio in different tests. result unit is MB/s.
>>>>
>>>> iodepth=1 numjobs=1 prio 4:4
>>>> CFQ: 28:28 BFQ: 21:21 deadline: 29:29
>>>>
>>>> iodepth=8 numjobs=1 prio 4:4
>>>> CFQ: 162:162 BFQ: 102:98 deadline: 205:205
>>>>
>>>> iodepth=1 numjobs=8 prio 4:4
>>>> CFQ: 157:157 BFQ: 81:92 deadline: 196:197
>>>>
>>>> iodepth=1 numjobs=1 prio 2:6
>>>> CFQ: 26.7:27.6 BFQ: 20:6 deadline: 29:29
>>>>
>>>> iodepth=8 numjobs=1 prio 2:6
>>>> CFQ: 166:174 BFQ: 139:72 deadline: 202:202
>>>>
>>>> iodepth=1 numjobs=8 prio 2:6
>>>> CFQ: 148:150 BFQ: 90:77 deadline: 198:197
>>>>
>>>> CFQ isn't fair at all. BFQ is very good in this side, but has poor throughput
>>>> even prio is the default value.
>>>>
>>>
>>> Throughput is lower with BFQ for two reasons.
>>>
>>> First, you certainly left the low_latency in its default state, i.e.,
>>> on. As explained, e.g., here [2], low_latency mode is totally geared
>>> towards maximum responsiveness and minimum latency for soft real-time
>>> applications (e.g., video players). To achieve this goal, BFQ is
>>> willing to perform more idling, when necessary. This lowers
>>> throughput (I'll get back on this at the end of the discussion of the
>>> second reason).
>>
>> changing low_latency to 0 seems not change anything, at least for the test:
>> iodepth=1 numjobs=1 prio 2:6 A bs 4k:64k
>>
>>> The second, most important reason, is that a minimum of idling is the
>>> *only* way to achieve differentiated bandwidth distribution, as you
>>> requested by setting different ioprios. I stress that this constraint
>>> is not a technological accident, but a intrinsic, logical necessity.
>>> The proof is simple, and if the following explanation is too boring or
>>> confusing, I can show it to you with any trace of sync I/O.
>>>
>>> First, to provide differentiated service, you need per-process
>>> scheduling, i.e., schedulers in which there is a separate queue
>>> associated with each process. Now, let A be the process with higher
>>> weight (ioprio), and B the process with lower weight. Both processes
>>> are sync, thus, by definition, they issue requests as follows: a few
>>> requests (probably two, or a little bit more with larger iodepth),
>>> then a little break to wait for request completion, then the next
>>> small batch and so on. For each process, the queue associated with
>>> the process (in the scheduler) is necessarily empty on the break. As
>>> a consequence, if there is no idling, then every time A reaches its
>>> break, the scheduler has only the option to switch to B (which is
>>> extremely likely to have pending requests).
>>>
>>> The service pattern of the processes then unavoidably becomes:
>>>
>>> A B A B A B ...
>>>
>>> where each letter represents a full small batch served for the
>>> process. That is, 50% of the bw for each process, and complete loss
>>> of control on the desired bandwidth distribution.
>>>
>>> So, to sum up, the reason why BFQ achieves a lower total bw is that it
>>> behaves in the only correct way to respect weights with sync I/O,
>>> i.e., it performs a little idling. If low_latency is on, then BFQ
>>> increases idling further, and this may be have caused further bw loss
>>> in your test (but this varies greatly with devices, so you can
>>> discover it only by trying).
>>>
>>> The bottom line is that if you do want to achieve differentiation with
>>> sync I/O, you have to pay a price in terms of bw, because of idling.
>>> Actually, the recent preemption mechanism that I have introduced in
>>> BFQ is proving so effective in preserving differentiation, that I'm
>>> tempted to try some almost idleness solution. A little of accuracy
>>> should however be sacrificed. Anyway, this is still work in progress.
>>
>> Yep, I fully understand why idle is required here. As long as workload io depth
>> is lower than queue io depth, idle is the only way to maintain fairness. This
>> is the core of CFQ, I bet the same for BFQ. Unfortunately idle disk harms
>> throughput too much especially for high end SSD.
>>
>
> Then I'm afraid I have to give you very bad news: bw limiting causes
> the same throughput loss. You can see it from your very tests. Here
> is one of your results with BFQ (one that is likely to have been
> affected less by the fact that you left low_latency on, or by further
> issues that I may have not yet addressed thoroughly):
>
> iodepth=8 numjobs=1 prio 2:6
> CFQ: 166:174 BFQ: 139:72 deadline: 202:202
>
> Here is, instead, your test with bw limitation:
> iodepth=8 numjobs=1 prio 2:6, group A has 50M/s limit
> CFQ:51:207 BFQ: 51:45 deadline: 51:216
>
> From the first test, you see that the total bw achievable by the
> device is at least 404MB/s. But in the second test you get at most
> 267MB/s, with deadline. In this respect, the total bw achieved by BFQ
> in the first test is 211MB/s.
>
> So, both throttling and proportional share need to waste bw, BFQ
> looses about 13% more of the total bw. In return, it gives you
> incomparably better bw and latency guarantees, while allowing you to
> configure your system with zero or minimal effort. In contrast, using
> bw limits to properly configure a common system like, e.g., a large
> file server, may become a nightmare for a sysadmin. For example, if,
> in the simplest case, she/he configures limits for the worst-case,
> then per-client limits will have to be extremely low. But the system
> is large and dynamic, so the actual number of clients and at the
> actual bw consumed by each client will vary without a break, even in
> the short-medium term. The bw redistribution heuristic do not give
> any provable guarantees on accuracy of bw redistribution. The result
> will likely be highly varying client bandwidths, with unlucky clients
> unjustly limited to low limits, and then experiencing high latencies.
> The latter will be further emphasized by the intrinsically bursty
> nature of throttling.
>
> In addition, the scenario in your tests is the worst case for a
> proportional share solution: in a generic system, such as the the file
> server above, part of the workload is likely to be sequential or
> quasi-sequential (at least in medium-length time intervals), and this
> is enough to get very close to peak bw with a proportional-share
> scheduler. No configuration needed. With bw throttling, you must do
> the math very well to get peak bw all the time in a dynamic system.
>
> To sum up, I do think that bw throttling is the best possible solution
> (and in particular your solution is very good) till blk-mq lacks an
> accurate scheduler. But, where you do have such a scheduler, it makes
> very little sense to give users a very bad service, for fear of
> wasting at most 13% of the bw in a single worst case.
>

Shaohua, I have just realized that I have unconsciously defended a
wrong argument. Although all the facts that I have reported are
evidently true, I have argued as if the question was: "do we need to
throw away throttling because there is proportional, or do we need to
throw away proportional share because there is throttling?". This
question is simply wrong, as I think consciously (sorry for my
dissociated behavior :) ).

The best goal to achieve is to have both a good throttling mechanism,
and a good proportional share scheduler. This goal would be valid if
even if there was just one important scenario for each of the two
approaches. The vulnus here is that you guys are constantly, and
rightly, working on solutions to achieve and consolidate reasonable
QoS guarantees, but an apparently very good proportional-share
scheduler has been kept off for years. If you (or others) have good
arguments to support this state of affairs, then this would probably
be an important point to discuss.

Thanks,
Paolo

> Thanks,
> Paolo
>
>
>> Thanks,
>> Shaohua
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-block" in
>> the body of a message to [email protected]
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
>
> --
> Paolo Valente
> Algogroup
> Dipartimento di Scienze Fisiche, Informatiche e Matematiche
> Via Campi 213/B
> 41125 Modena - Italy
> http://algogroup.unimore.it/people/paolo/
>
>
>
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-block" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html


--
Paolo Valente
Algogroup
Dipartimento di Scienze Fisiche, Informatiche e Matematiche
Via Campi 213/B
41125 Modena - Italy
http://algogroup.unimore.it/people/paolo/





2016-10-06 13:53:34

by Austin S Hemmelgarn

[permalink] [raw]
Subject: Re: [PATCH V3 00/11] block-throttle: add .high limit

On 2016-10-06 08:50, Paolo Valente wrote:
>
>> Il giorno 06 ott 2016, alle ore 13:57, Austin S. Hemmelgarn <[email protected]> ha scritto:
>>
>> On 2016-10-06 07:03, Mark Brown wrote:
>>> On Thu, Oct 06, 2016 at 10:04:41AM +0200, Linus Walleij wrote:
>>>> On Tue, Oct 4, 2016 at 9:14 PM, Tejun Heo <[email protected]> wrote:
>>>
>>>>> I get that bfq can be a good compromise on most desktop workloads and
>>>>> behave reasonably well for some server workloads with the slice
>>>>> expiration mechanism but it really isn't an IO resource partitioning
>>>>> mechanism.
>>>
>>>> Not just desktops, also Android phones.
>>>
>>>> So why not have BFQ as a separate scheduling policy upstream,
>>>> alongside CFQ, deadline and noop?
>>>
>>> Right.
>>>
>>>> We're already doing the per-usecase Kconfig thing for preemption.
>>>> But maybe somebody already hates that and want to get rid of it,
>>>> I don't know.
>>>
>>> Hannes also suggested going back to making BFQ a separate scheduler
>>> rather than replacing CFQ earlier, pointing out that it mitigates
>>> against the risks of changing CFQ substantially at this point (which
>>> seems to be the biggest issue here).
>>>
>> ISTR that the original argument for this approach essentially amounted to: 'If it's so much better, why do we need both?'.
>>
>> Such an argument is valid only if the new design is better in all respects (which there isn't sufficient information to decide in this case), or the negative aspects are worth the improvements (which is too workload specific to decide for something like this).
>
> All correct, apart from the workload-specific issue, which is not very clear to me. Over the last five years I have not found a single workload for which CFQ is better than BFQ, and none has been suggested.
My point is that whether or not BFQ is better depends on the workload.
You can't test for every workload, so you can't say definitively that
BFQ is better for every workload. At a minimum, there are workloads
where the deadline and noop schedulers are better, but they're very
domain specific workloads. Based on the numbers from Shaohua, it looks
like CFQ has better throughput than BFQ, and that will affect some
workloads (for most, the improved fairness is worth the reduced
throughput, but there probably are some cases where it isn't).
>
> Anyway, leaving aside this fact, IMO the real problem here is that we are in a catch-22: "we want BFQ to replace CFQ, but, since CFQ is legacy code, then you cannot change, and thus replace, CFQ"
I agree that that's part of the issue, but I also don't entirely agree
with the reasoning on it. Until blk-mq has proper I/O scheduling,
people will continue to use CFQ, and based on the way things are going,
it will be multiple months before that happens, whereas BFQ exists and
is working now.

2016-10-06 15:06:09

by Paolo Valente

[permalink] [raw]
Subject: Re: [PATCH V3 00/11] block-throttle: add .high limit


> Il giorno 06 ott 2016, alle ore 15:52, Austin S. Hemmelgarn <[email protected]> ha scritto:
>
> On 2016-10-06 08:50, Paolo Valente wrote:
>>
>>> Il giorno 06 ott 2016, alle ore 13:57, Austin S. Hemmelgarn <[email protected]> ha scritto:
>>>
>>> On 2016-10-06 07:03, Mark Brown wrote:
>>>> On Thu, Oct 06, 2016 at 10:04:41AM +0200, Linus Walleij wrote:
>>>>> On Tue, Oct 4, 2016 at 9:14 PM, Tejun Heo <[email protected]> wrote:
>>>>
>>>>>> I get that bfq can be a good compromise on most desktop workloads and
>>>>>> behave reasonably well for some server workloads with the slice
>>>>>> expiration mechanism but it really isn't an IO resource partitioning
>>>>>> mechanism.
>>>>
>>>>> Not just desktops, also Android phones.
>>>>
>>>>> So why not have BFQ as a separate scheduling policy upstream,
>>>>> alongside CFQ, deadline and noop?
>>>>
>>>> Right.
>>>>
>>>>> We're already doing the per-usecase Kconfig thing for preemption.
>>>>> But maybe somebody already hates that and want to get rid of it,
>>>>> I don't know.
>>>>
>>>> Hannes also suggested going back to making BFQ a separate scheduler
>>>> rather than replacing CFQ earlier, pointing out that it mitigates
>>>> against the risks of changing CFQ substantially at this point (which
>>>> seems to be the biggest issue here).
>>>>
>>> ISTR that the original argument for this approach essentially amounted to: 'If it's so much better, why do we need both?'.
>>>
>>> Such an argument is valid only if the new design is better in all respects (which there isn't sufficient information to decide in this case), or the negative aspects are worth the improvements (which is too workload specific to decide for something like this).
>>
>> All correct, apart from the workload-specific issue, which is not very clear to me. Over the last five years I have not found a single workload for which CFQ is better than BFQ, and none has been suggested.
> My point is that whether or not BFQ is better depends on the workload. You can't test for every workload, so you can't say definitively that BFQ is better for every workload.

Yes

> At a minimum, there are workloads where the deadline and noop schedulers are better, but they're very domain specific workloads.

Definitely

> Based on the numbers from Shaohua, it looks like CFQ has better throughput than BFQ, and that will affect some workloads (for most, the improved fairness is worth the reduced throughput, but there probably are some cases where it isn't).

Well, no fairness as deadline and noop, but with much less throughput
than deadline and noop, doesn't sound much like the best scheduler for
those workloads. With BFQ you have service guarantees, with noop or
deadline you have maximum throughput.

>>
>> Anyway, leaving aside this fact, IMO the real problem here is that we are in a catch-22: "we want BFQ to replace CFQ, but, since CFQ is legacy code, then you cannot change, and thus replace, CFQ"
> I agree that that's part of the issue, but I also don't entirely agree with the reasoning on it. Until blk-mq has proper I/O scheduling, people will continue to use CFQ, and based on the way things are going, it will be multiple months before that happens, whereas BFQ exists and is working now.

Exactly!

Thanks,
Paolo

> --
> To unsubscribe from this list: send the line "unsubscribe linux-block" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html


--
Paolo Valente
Algogroup
Dipartimento di Scienze Fisiche, Informatiche e Matematiche
Via Campi 213/B
41125 Modena - Italy
http://algogroup.unimore.it/people/paolo/





2016-10-06 15:11:11

by Austin S Hemmelgarn

[permalink] [raw]
Subject: Re: [PATCH V3 00/11] block-throttle: add .high limit

On 2016-10-06 11:05, Paolo Valente wrote:
>
>> Il giorno 06 ott 2016, alle ore 15:52, Austin S. Hemmelgarn <[email protected]> ha scritto:
>>
>> On 2016-10-06 08:50, Paolo Valente wrote:
>>>
>>>> Il giorno 06 ott 2016, alle ore 13:57, Austin S. Hemmelgarn <[email protected]> ha scritto:
>>>>
>>>> On 2016-10-06 07:03, Mark Brown wrote:
>>>>> On Thu, Oct 06, 2016 at 10:04:41AM +0200, Linus Walleij wrote:
>>>>>> On Tue, Oct 4, 2016 at 9:14 PM, Tejun Heo <[email protected]> wrote:
>>>>>
>>>>>>> I get that bfq can be a good compromise on most desktop workloads and
>>>>>>> behave reasonably well for some server workloads with the slice
>>>>>>> expiration mechanism but it really isn't an IO resource partitioning
>>>>>>> mechanism.
>>>>>
>>>>>> Not just desktops, also Android phones.
>>>>>
>>>>>> So why not have BFQ as a separate scheduling policy upstream,
>>>>>> alongside CFQ, deadline and noop?
>>>>>
>>>>> Right.
>>>>>
>>>>>> We're already doing the per-usecase Kconfig thing for preemption.
>>>>>> But maybe somebody already hates that and want to get rid of it,
>>>>>> I don't know.
>>>>>
>>>>> Hannes also suggested going back to making BFQ a separate scheduler
>>>>> rather than replacing CFQ earlier, pointing out that it mitigates
>>>>> against the risks of changing CFQ substantially at this point (which
>>>>> seems to be the biggest issue here).
>>>>>
>>>> ISTR that the original argument for this approach essentially amounted to: 'If it's so much better, why do we need both?'.
>>>>
>>>> Such an argument is valid only if the new design is better in all respects (which there isn't sufficient information to decide in this case), or the negative aspects are worth the improvements (which is too workload specific to decide for something like this).
>>>
>>> All correct, apart from the workload-specific issue, which is not very clear to me. Over the last five years I have not found a single workload for which CFQ is better than BFQ, and none has been suggested.
>> My point is that whether or not BFQ is better depends on the workload. You can't test for every workload, so you can't say definitively that BFQ is better for every workload.
>
> Yes
>
>> At a minimum, there are workloads where the deadline and noop schedulers are better, but they're very domain specific workloads.
>
> Definitely
>
>> Based on the numbers from Shaohua, it looks like CFQ has better throughput than BFQ, and that will affect some workloads (for most, the improved fairness is worth the reduced throughput, but there probably are some cases where it isn't).
>
> Well, no fairness as deadline and noop, but with much less throughput
> than deadline and noop, doesn't sound much like the best scheduler for
> those workloads. With BFQ you have service guarantees, with noop or
> deadline you have maximum throughput.
And with CFQ you have something in between, which is half of why I think
CFQ is still worth keeping (the other half being the people who
inevitably want to stay on CFQ). And TBH, deadline and noop only give
good throughput with specific workloads (and in the case of noop, it's
usually only useful on tiny systems where the overhead of scheduling is
greater than the time saved by doing so (like some very low power
embedded systems), or when you have scheduling done elsewher in the
storage stack (like in a VM)).
>
>>>
>>> Anyway, leaving aside this fact, IMO the real problem here is that we are in a catch-22: "we want BFQ to replace CFQ, but, since CFQ is legacy code, then you cannot change, and thus replace, CFQ"
>> I agree that that's part of the issue, but I also don't entirely agree with the reasoning on it. Until blk-mq has proper I/O scheduling, people will continue to use CFQ, and based on the way things are going, it will be multiple months before that happens, whereas BFQ exists and is working now.
>
> Exactly!
>
> Thanks,
> Paolo
>

2016-10-06 17:49:52

by Vivek Goyal

[permalink] [raw]
Subject: Re: [PATCH V3 00/11] block-throttle: add .high limit

On Thu, Oct 06, 2016 at 03:15:50PM +0200, Paolo Valente wrote:

[..]
> Shaohua, I have just realized that I have unconsciously defended a
> wrong argument. Although all the facts that I have reported are
> evidently true, I have argued as if the question was: "do we need to
> throw away throttling because there is proportional, or do we need to
> throw away proportional share because there is throttling?". This
> question is simply wrong, as I think consciously (sorry for my
> dissociated behavior :) ).

I was wondering about the same. We need both and both should be able
to work with fast devices of today using blk-mq interfaces without
much overhead.

>
> The best goal to achieve is to have both a good throttling mechanism,
> and a good proportional share scheduler. This goal would be valid if
> even if there was just one important scenario for each of the two
> approaches. The vulnus here is that you guys are constantly, and
> rightly, working on solutions to achieve and consolidate reasonable
> QoS guarantees, but an apparently very good proportional-share
> scheduler has been kept off for years. If you (or others) have good
> arguments to support this state of affairs, then this would probably
> be an important point to discuss.

Paolo, CFQ is legacy now and if we can come up with a proportional
IO mechanism which works reasonably well with fast devices using
blk-mq interfaces, that will be much more interesting.

Vivek

2016-10-06 18:01:51

by Paolo Valente

[permalink] [raw]
Subject: Re: [PATCH V3 00/11] block-throttle: add .high limit


> Il giorno 06 ott 2016, alle ore 19:49, Vivek Goyal <[email protected]> ha scritto:
>
> On Thu, Oct 06, 2016 at 03:15:50PM +0200, Paolo Valente wrote:
>
> [..]
>> Shaohua, I have just realized that I have unconsciously defended a
>> wrong argument. Although all the facts that I have reported are
>> evidently true, I have argued as if the question was: "do we need to
>> throw away throttling because there is proportional, or do we need to
>> throw away proportional share because there is throttling?". This
>> question is simply wrong, as I think consciously (sorry for my
>> dissociated behavior :) ).
>
> I was wondering about the same. We need both and both should be able
> to work with fast devices of today using blk-mq interfaces without
> much overhead.
>
>>
>> The best goal to achieve is to have both a good throttling mechanism,
>> and a good proportional share scheduler. This goal would be valid if
>> even if there was just one important scenario for each of the two
>> approaches. The vulnus here is that you guys are constantly, and
>> rightly, working on solutions to achieve and consolidate reasonable
>> QoS guarantees, but an apparently very good proportional-share
>> scheduler has been kept off for years. If you (or others) have good
>> arguments to support this state of affairs, then this would probably
>> be an important point to discuss.
>
> Paolo, CFQ is legacy now and if we can come up with a proportional
> IO mechanism which works reasonably well with fast devices using
> blk-mq interfaces, that will be much more interesting.
>

That's absolutely true. But, why do we pretend not to know that, for
(at least) hundreds of thousands of users Linux will go on giving bad
responsiveness, starvation, high latency and unfairness, until blk
will not be used any more (assuming that these problems will somehow
disappear will blk-mq). Many of these users are fully aware of these
Linux long-standing problems. We could solve these problems by just
adding a scheduler that has already been adopted, and thus extensively
tested, by thousands of users. And more and more people are aware of
this fact too. Are we doing the right thing?

Thanks,
Paolo


> Vivek


--
Paolo Valente
Algogroup
Dipartimento di Scienze Fisiche, Informatiche e Matematiche
Via Campi 213/B
41125 Modena - Italy
http://algogroup.unimore.it/people/paolo/





2016-10-06 18:32:20

by Vivek Goyal

[permalink] [raw]
Subject: Re: [PATCH V3 00/11] block-throttle: add .high limit

On Thu, Oct 06, 2016 at 08:01:42PM +0200, Paolo Valente wrote:
>
> > Il giorno 06 ott 2016, alle ore 19:49, Vivek Goyal <[email protected]> ha scritto:
> >
> > On Thu, Oct 06, 2016 at 03:15:50PM +0200, Paolo Valente wrote:
> >
> > [..]
> >> Shaohua, I have just realized that I have unconsciously defended a
> >> wrong argument. Although all the facts that I have reported are
> >> evidently true, I have argued as if the question was: "do we need to
> >> throw away throttling because there is proportional, or do we need to
> >> throw away proportional share because there is throttling?". This
> >> question is simply wrong, as I think consciously (sorry for my
> >> dissociated behavior :) ).
> >
> > I was wondering about the same. We need both and both should be able
> > to work with fast devices of today using blk-mq interfaces without
> > much overhead.
> >
> >>
> >> The best goal to achieve is to have both a good throttling mechanism,
> >> and a good proportional share scheduler. This goal would be valid if
> >> even if there was just one important scenario for each of the two
> >> approaches. The vulnus here is that you guys are constantly, and
> >> rightly, working on solutions to achieve and consolidate reasonable
> >> QoS guarantees, but an apparently very good proportional-share
> >> scheduler has been kept off for years. If you (or others) have good
> >> arguments to support this state of affairs, then this would probably
> >> be an important point to discuss.
> >
> > Paolo, CFQ is legacy now and if we can come up with a proportional
> > IO mechanism which works reasonably well with fast devices using
> > blk-mq interfaces, that will be much more interesting.
> >
>
> That's absolutely true. But, why do we pretend not to know that, for
> (at least) hundreds of thousands of users Linux will go on giving bad
> responsiveness, starvation, high latency and unfairness, until blk
> will not be used any more (assuming that these problems will somehow
> disappear will blk-mq). Many of these users are fully aware of these
> Linux long-standing problems. We could solve these problems by just
> adding a scheduler that has already been adopted, and thus extensively
> tested, by thousands of users. And more and more people are aware of
> this fact too. Are we doing the right thing?

Hi Paolo,

People have been using CFQ for many years. I am not sure if benefits
offered by BFQ over CFQ are significant enough to justify taking a
completely new code and get rid of CFQ. Or are the benfits significant
enough that one feels like putting time and effort into this and
take chances wiht new code.

At this point of time replacing CFQ with something better is not a
priority for me. But if something better and stable goes upstream, I
will gladly use it.

Vivek

2016-10-06 19:45:35

by Mark Brown

[permalink] [raw]
Subject: Re: [PATCH V3 00/11] block-throttle: add .high limit

On Thu, Oct 06, 2016 at 01:49:43PM -0400, Vivek Goyal wrote:

> Paolo, CFQ is legacy now and if we can come up with a proportional
> IO mechanism which works reasonably well with fast devices using
> blk-mq interfaces, that will be much more interesting.

CFQ is not legacy yet. Until we've got scheduling in blk-mq and
converted all the existing users over to blk-mq it's still going
to be something that affects real, production systems that don't
have any other meaningful option.


Attachments:
(No filename) (485.00 B)
signature.asc (455.00 B)
Download all attachments

2016-10-06 20:00:14

by Shaohua Li

[permalink] [raw]
Subject: Re: [PATCH V3 00/11] block-throttle: add .high limit

On Thu, Oct 06, 2016 at 09:58:44AM +0200, Paolo Valente wrote:
>
> > Il giorno 05 ott 2016, alle ore 22:46, Shaohua Li <[email protected]> ha scritto:
> >
> > On Wed, Oct 05, 2016 at 09:47:19PM +0200, Paolo Valente wrote:
> >>
> >>> Il giorno 05 ott 2016, alle ore 20:30, Shaohua Li <[email protected]> ha scritto:
> >>>
> >>> On Wed, Oct 05, 2016 at 10:49:46AM -0400, Tejun Heo wrote:
> >>>> Hello, Paolo.
> >>>>
> >>>> On Wed, Oct 05, 2016 at 02:37:00PM +0200, Paolo Valente wrote:
> >>>>> In this respect, for your generic, unpredictable scenario to make
> >>>>> sense, there must exist at least one real system that meets the
> >>>>> requirements of such a scenario. Or, if such a real system does not
> >>>>> yet exist, it must be possible to emulate it. If it is impossible to
> >>>>> achieve this last goal either, then I miss the usefulness
> >>>>> of looking for solutions for such a scenario.
> >>>>>
> >>>>> That said, let's define the instance(s) of the scenario that you find
> >>>>> most representative, and let's test BFQ on it/them. Numbers will give
> >>>>> us the answers. For example, what about all or part of the following
> >>>>> groups:
> >>>>> . one cyclically doing random I/O for some second and then sequential I/O
> >>>>> for the next seconds
> >>>>> . one doing, say, quasi-sequential I/O in ON/OFF cycles
> >>>>> . one starting an application cyclically
> >>>>> . one playing back or streaming a movie
> >>>>>
> >>>>> For each group, we could then measure the time needed to complete each
> >>>>> phase of I/O in each cycle, plus the responsiveness in the group
> >>>>> starting an application, plus the frame drop in the group streaming
> >>>>> the movie. In addition, we can measure the bandwidth/iops enjoyed by
> >>>>> each group, plus, of course, the aggregate throughput of the whole
> >>>>> system. In particular we could compare results with throttling, BFQ,
> >>>>> and CFQ.
> >>>>>
> >>>>> Then we could write resulting numbers on the stone, and stick to them
> >>>>> until something proves them wrong.
> >>>>>
> >>>>> What do you (or others) think about it?
> >>>>
> >>>> That sounds great and yeah it's lame that we didn't start with that.
> >>>> Shaohua, would it be difficult to compare how bfq performs against
> >>>> blk-throttle?
> >>>
> >>> I had a test of BFQ.
> >>
> >> Thank you very much for testing BFQ!
> >>
> >>> I'm using BFQ found at
> >>> https://urldefense.proofpoint.com/v2/url?u=http-3A__algogroup.unimore.it_people_paolo_disk-5Fsched_sources.php&d=DQIFAg&c=5VD0RTtNlTh3ycd41b3MUw&r=i6WobKxbeG3slzHSIOxTVtYIJw7qjCE6S0spDTKL-J4&m=2pG8KEx5tRymExa_K0ddKH_YvhH3qvJxELBd1_lw0-w&s=FZKEAOu2sw95y9jZio2k012cQWoLzlBWDl0NiGPVW78&e= . version is
> >>> 4.7.0-v8r3.
> >>
> >> That's the latest stable version. The development version [1] already
> >> contains further improvements for fairness, latency and throughput.
> >> It is however still a release candidate.
> >>
> >> [1] https://github.com/linusw/linux-bfq/tree/bfq-v8
> >>
> >>> It's a LSI SSD, queue depth 32. I use default setting. fio script
> >>> is:
> >>>
> >>> [global]
> >>> ioengine=libaio
> >>> direct=1
> >>> readwrite=randread
> >>> bs=4k
> >>> runtime=60
> >>> time_based=1
> >>> file_service_type=random:36
> >>> overwrite=1
> >>> thread=0
> >>> group_reporting=1
> >>> filename=/dev/sdb
> >>> iodepth=1
> >>> numjobs=8
> >>>
> >>> [groupA]
> >>> prio=2
> >>>
> >>> [groupB]
> >>> new_group
> >>> prio=6
> >>>
> >>> I'll change iodepth, numjobs and prio in different tests. result unit is MB/s.
> >>>
> >>> iodepth=1 numjobs=1 prio 4:4
> >>> CFQ: 28:28 BFQ: 21:21 deadline: 29:29
> >>>
> >>> iodepth=8 numjobs=1 prio 4:4
> >>> CFQ: 162:162 BFQ: 102:98 deadline: 205:205
> >>>
> >>> iodepth=1 numjobs=8 prio 4:4
> >>> CFQ: 157:157 BFQ: 81:92 deadline: 196:197
> >>>
> >>> iodepth=1 numjobs=1 prio 2:6
> >>> CFQ: 26.7:27.6 BFQ: 20:6 deadline: 29:29
> >>>
> >>> iodepth=8 numjobs=1 prio 2:6
> >>> CFQ: 166:174 BFQ: 139:72 deadline: 202:202
> >>>
> >>> iodepth=1 numjobs=8 prio 2:6
> >>> CFQ: 148:150 BFQ: 90:77 deadline: 198:197
> >>>
> >>> CFQ isn't fair at all. BFQ is very good in this side, but has poor throughput
> >>> even prio is the default value.
> >>>
> >>
> >> Throughput is lower with BFQ for two reasons.
> >>
> >> First, you certainly left the low_latency in its default state, i.e.,
> >> on. As explained, e.g., here [2], low_latency mode is totally geared
> >> towards maximum responsiveness and minimum latency for soft real-time
> >> applications (e.g., video players). To achieve this goal, BFQ is
> >> willing to perform more idling, when necessary. This lowers
> >> throughput (I'll get back on this at the end of the discussion of the
> >> second reason).
> >
> > changing low_latency to 0 seems not change anything, at least for the test:
> > iodepth=1 numjobs=1 prio 2:6 A bs 4k:64k
> >
> >> The second, most important reason, is that a minimum of idling is the
> >> *only* way to achieve differentiated bandwidth distribution, as you
> >> requested by setting different ioprios. I stress that this constraint
> >> is not a technological accident, but a intrinsic, logical necessity.
> >> The proof is simple, and if the following explanation is too boring or
> >> confusing, I can show it to you with any trace of sync I/O.
> >>
> >> First, to provide differentiated service, you need per-process
> >> scheduling, i.e., schedulers in which there is a separate queue
> >> associated with each process. Now, let A be the process with higher
> >> weight (ioprio), and B the process with lower weight. Both processes
> >> are sync, thus, by definition, they issue requests as follows: a few
> >> requests (probably two, or a little bit more with larger iodepth),
> >> then a little break to wait for request completion, then the next
> >> small batch and so on. For each process, the queue associated with
> >> the process (in the scheduler) is necessarily empty on the break. As
> >> a consequence, if there is no idling, then every time A reaches its
> >> break, the scheduler has only the option to switch to B (which is
> >> extremely likely to have pending requests).
> >>
> >> The service pattern of the processes then unavoidably becomes:
> >>
> >> A B A B A B ...
> >>
> >> where each letter represents a full small batch served for the
> >> process. That is, 50% of the bw for each process, and complete loss
> >> of control on the desired bandwidth distribution.
> >>
> >> So, to sum up, the reason why BFQ achieves a lower total bw is that it
> >> behaves in the only correct way to respect weights with sync I/O,
> >> i.e., it performs a little idling. If low_latency is on, then BFQ
> >> increases idling further, and this may be have caused further bw loss
> >> in your test (but this varies greatly with devices, so you can
> >> discover it only by trying).
> >>
> >> The bottom line is that if you do want to achieve differentiation with
> >> sync I/O, you have to pay a price in terms of bw, because of idling.
> >> Actually, the recent preemption mechanism that I have introduced in
> >> BFQ is proving so effective in preserving differentiation, that I'm
> >> tempted to try some almost idleness solution. A little of accuracy
> >> should however be sacrificed. Anyway, this is still work in progress.
> >
> > Yep, I fully understand why idle is required here. As long as workload io depth
> > is lower than queue io depth, idle is the only way to maintain fairness. This
> > is the core of CFQ, I bet the same for BFQ. Unfortunately idle disk harms
> > throughput too much especially for high end SSD.
> >
>
> Then I'm afraid I have to give you very bad news: bw limiting causes
> the same throughput loss. You can see it from your very tests. Here
> is one of your results with BFQ (one that is likely to have been
> affected less by the fact that you left low_latency on, or by further
> issues that I may have not yet addressed thoroughly):
>
> iodepth=8 numjobs=1 prio 2:6
> CFQ: 166:174 BFQ: 139:72 deadline: 202:202
>
> Here is, instead, your test with bw limitation:
> iodepth=8 numjobs=1 prio 2:6, group A has 50M/s limit
> CFQ:51:207 BFQ: 51:45 deadline: 51:216
>
> From the first test, you see that the total bw achievable by the
> device is at least 404MB/s. But in the second test you get at most
> 267MB/s, with deadline. In this respect, the total bw achieved by BFQ
> in the first test is 211MB/s.
>
> So, both throttling and proportional share need to waste bw, BFQ
> looses about 13% more of the total bw.

I don't think you calculate this correct. In iodepth 8 request size 4k, the
workload can only dispatch 216M/s. Even group A doesn't dispatch any IO, group
B can only dispatch 216M/s. So deadline doesn't waste any bw. CFQ wastes 216 -
207, while BFQ wastes 216 - 45. That's the problem.

> In return, it gives you
> incomparably better bw and latency guarantees, while allowing you to
> configure your system with zero or minimal effort. In contrast, using
> bw limits to properly configure a common system like, e.g., a large
> file server, may become a nightmare for a sysadmin. For example, if,
> in the simplest case, she/he configures limits for the worst-case,
> then per-client limits will have to be extremely low. But the system
> is large and dynamic, so the actual number of clients and at the
> actual bw consumed by each client will vary without a break, even in
> the short-medium term. The bw redistribution heuristic do not give
> any provable guarantees on accuracy of bw redistribution. The result
> will likely be highly varying client bandwidths, with unlucky clients
> unjustly limited to low limits, and then experiencing high latencies.
> The latter will be further emphasized by the intrinsically bursty
> nature of throttling.

I don't disagree here. The bw/iops throttling is not easy to configure. It's
kind of low level configuration, people need to know the workload very well to
configure. If we have way to do proportional scheduling, everybody will cheer
up. The goal of the tests is checking if proportional scheduling is feasible,
the result isn't very optimistic so far. No, I don't say BFQ isn't good. It is
much more fair than CFQ. I suppose it would work well for desktop workloads.

> In addition, the scenario in your tests is the worst case for a
> proportional share solution: in a generic system, such as the the file
> server above, part of the workload is likely to be sequential or
> quasi-sequential (at least in medium-length time intervals), and this
> is enough to get very close to peak bw with a proportional-share
> scheduler. No configuration needed. With bw throttling, you must do
> the math very well to get peak bw all the time in a dynamic system.

I'm afraid this is not true. Workloads seldomly fully utilize the bandwidth of
a highend SSD. At least that's true here. My tests are actually pretty normal.
I didn't do weird tests at all.

Thanks,
Shaohua

2016-10-06 20:51:47

by Paolo Valente

[permalink] [raw]
Subject: Re: [PATCH V3 00/11] block-throttle: add .high limit


> Il giorno 06 ott 2016, alle ore 20:32, Vivek Goyal <[email protected]> ha scritto:
>
> On Thu, Oct 06, 2016 at 08:01:42PM +0200, Paolo Valente wrote:
>>
>>> Il giorno 06 ott 2016, alle ore 19:49, Vivek Goyal <[email protected]> ha scritto:
>>>
>>> On Thu, Oct 06, 2016 at 03:15:50PM +0200, Paolo Valente wrote:
>>>
>>> [..]
>>>> Shaohua, I have just realized that I have unconsciously defended a
>>>> wrong argument. Although all the facts that I have reported are
>>>> evidently true, I have argued as if the question was: "do we need to
>>>> throw away throttling because there is proportional, or do we need to
>>>> throw away proportional share because there is throttling?". This
>>>> question is simply wrong, as I think consciously (sorry for my
>>>> dissociated behavior :) ).
>>>
>>> I was wondering about the same. We need both and both should be able
>>> to work with fast devices of today using blk-mq interfaces without
>>> much overhead.
>>>
>>>>
>>>> The best goal to achieve is to have both a good throttling mechanism,
>>>> and a good proportional share scheduler. This goal would be valid if
>>>> even if there was just one important scenario for each of the two
>>>> approaches. The vulnus here is that you guys are constantly, and
>>>> rightly, working on solutions to achieve and consolidate reasonable
>>>> QoS guarantees, but an apparently very good proportional-share
>>>> scheduler has been kept off for years. If you (or others) have good
>>>> arguments to support this state of affairs, then this would probably
>>>> be an important point to discuss.
>>>
>>> Paolo, CFQ is legacy now and if we can come up with a proportional
>>> IO mechanism which works reasonably well with fast devices using
>>> blk-mq interfaces, that will be much more interesting.
>>>
>>
>> That's absolutely true. But, why do we pretend not to know that, for
>> (at least) hundreds of thousands of users Linux will go on giving bad
>> responsiveness, starvation, high latency and unfairness, until blk
>> will not be used any more (assuming that these problems will somehow
>> disappear will blk-mq). Many of these users are fully aware of these
>> Linux long-standing problems. We could solve these problems by just
>> adding a scheduler that has already been adopted, and thus extensively
>> tested, by thousands of users. And more and more people are aware of
>> this fact too. Are we doing the right thing?
>
> Hi Paolo,
>

Hi

> People have been using CFQ for many years.

Yes, but allow me just to add that a lot of people have also been
unhappy with CFQ for many years.

> I am not sure if benefits
> offered by BFQ over CFQ are significant enough to justify taking a
> completely new code and get rid of CFQ. Or are the benfits significant
> enough that one feels like putting time and effort into this and
> take chances wiht new code.
>

Although I think that BFQ's benefits are relevant (but I'm a little
bit an interested party :) ), I do agree that abruptly replacing the
most used I/O scheduler (AFAIK) with a so different one is at least a
little risky.

> At this point of time replacing CFQ with something better is not a
> priority for me.

ok

> But if something better and stable goes upstream, I
> will gladly use it.
>

Then, in case of success, I will be glad to receive some feedback from
you, and possibly use it to improve the set of ideas that we have put
into BFQ.

Thank you,
Paolo

> Vivek
> --
> To unsubscribe from this list: send the line "unsubscribe linux-block" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html


--
Paolo Valente
Algogroup
Dipartimento di Scienze Fisiche, Informatiche e Matematiche
Via Campi 213/B
41125 Modena - Italy
http://algogroup.unimore.it/people/paolo/





2016-10-06 22:24:20

by Paolo Valente

[permalink] [raw]
Subject: Re: [PATCH V3 00/11] block-throttle: add .high limit


> Il giorno 06 ott 2016, alle ore 21:57, Shaohua Li <[email protected]> ha scritto:
>
> On Thu, Oct 06, 2016 at 09:58:44AM +0200, Paolo Valente wrote:
>>
>>> Il giorno 05 ott 2016, alle ore 22:46, Shaohua Li <[email protected]> ha scritto:
>>>
>>> On Wed, Oct 05, 2016 at 09:47:19PM +0200, Paolo Valente wrote:
>>>>
>>>>> Il giorno 05 ott 2016, alle ore 20:30, Shaohua Li <[email protected]> ha scritto:
>>>>>
>>>>> On Wed, Oct 05, 2016 at 10:49:46AM -0400, Tejun Heo wrote:
>>>>>> Hello, Paolo.
>>>>>>
>>>>>> On Wed, Oct 05, 2016 at 02:37:00PM +0200, Paolo Valente wrote:
>>>>>>> In this respect, for your generic, unpredictable scenario to make
>>>>>>> sense, there must exist at least one real system that meets the
>>>>>>> requirements of such a scenario. Or, if such a real system does not
>>>>>>> yet exist, it must be possible to emulate it. If it is impossible to
>>>>>>> achieve this last goal either, then I miss the usefulness
>>>>>>> of looking for solutions for such a scenario.
>>>>>>>
>>>>>>> That said, let's define the instance(s) of the scenario that you find
>>>>>>> most representative, and let's test BFQ on it/them. Numbers will give
>>>>>>> us the answers. For example, what about all or part of the following
>>>>>>> groups:
>>>>>>> . one cyclically doing random I/O for some second and then sequential I/O
>>>>>>> for the next seconds
>>>>>>> . one doing, say, quasi-sequential I/O in ON/OFF cycles
>>>>>>> . one starting an application cyclically
>>>>>>> . one playing back or streaming a movie
>>>>>>>
>>>>>>> For each group, we could then measure the time needed to complete each
>>>>>>> phase of I/O in each cycle, plus the responsiveness in the group
>>>>>>> starting an application, plus the frame drop in the group streaming
>>>>>>> the movie. In addition, we can measure the bandwidth/iops enjoyed by
>>>>>>> each group, plus, of course, the aggregate throughput of the whole
>>>>>>> system. In particular we could compare results with throttling, BFQ,
>>>>>>> and CFQ.
>>>>>>>
>>>>>>> Then we could write resulting numbers on the stone, and stick to them
>>>>>>> until something proves them wrong.
>>>>>>>
>>>>>>> What do you (or others) think about it?
>>>>>>
>>>>>> That sounds great and yeah it's lame that we didn't start with that.
>>>>>> Shaohua, would it be difficult to compare how bfq performs against
>>>>>> blk-throttle?
>>>>>
>>>>> I had a test of BFQ.
>>>>
>>>> Thank you very much for testing BFQ!
>>>>
>>>>> I'm using BFQ found at
>>>>> https://urldefense.proofpoint.com/v2/url?u=http-3A__algogroup.unimore.it_people_paolo_disk-5Fsched_sources.php&d=DQIFAg&c=5VD0RTtNlTh3ycd41b3MUw&r=i6WobKxbeG3slzHSIOxTVtYIJw7qjCE6S0spDTKL-J4&m=2pG8KEx5tRymExa_K0ddKH_YvhH3qvJxELBd1_lw0-w&s=FZKEAOu2sw95y9jZio2k012cQWoLzlBWDl0NiGPVW78&e= . version is
>>>>> 4.7.0-v8r3.
>>>>
>>>> That's the latest stable version. The development version [1] already
>>>> contains further improvements for fairness, latency and throughput.
>>>> It is however still a release candidate.
>>>>
>>>> [1] https://github.com/linusw/linux-bfq/tree/bfq-v8
>>>>
>>>>> It's a LSI SSD, queue depth 32. I use default setting. fio script
>>>>> is:
>>>>>
>>>>> [global]
>>>>> ioengine=libaio
>>>>> direct=1
>>>>> readwrite=randread
>>>>> bs=4k
>>>>> runtime=60
>>>>> time_based=1
>>>>> file_service_type=random:36
>>>>> overwrite=1
>>>>> thread=0
>>>>> group_reporting=1
>>>>> filename=/dev/sdb
>>>>> iodepth=1
>>>>> numjobs=8
>>>>>
>>>>> [groupA]
>>>>> prio=2
>>>>>
>>>>> [groupB]
>>>>> new_group
>>>>> prio=6
>>>>>
>>>>> I'll change iodepth, numjobs and prio in different tests. result unit is MB/s.
>>>>>
>>>>> iodepth=1 numjobs=1 prio 4:4
>>>>> CFQ: 28:28 BFQ: 21:21 deadline: 29:29
>>>>>
>>>>> iodepth=8 numjobs=1 prio 4:4
>>>>> CFQ: 162:162 BFQ: 102:98 deadline: 205:205
>>>>>
>>>>> iodepth=1 numjobs=8 prio 4:4
>>>>> CFQ: 157:157 BFQ: 81:92 deadline: 196:197
>>>>>
>>>>> iodepth=1 numjobs=1 prio 2:6
>>>>> CFQ: 26.7:27.6 BFQ: 20:6 deadline: 29:29
>>>>>
>>>>> iodepth=8 numjobs=1 prio 2:6
>>>>> CFQ: 166:174 BFQ: 139:72 deadline: 202:202
>>>>>
>>>>> iodepth=1 numjobs=8 prio 2:6
>>>>> CFQ: 148:150 BFQ: 90:77 deadline: 198:197
>>>>>
>>>>> CFQ isn't fair at all. BFQ is very good in this side, but has poor throughput
>>>>> even prio is the default value.
>>>>>
>>>>
>>>> Throughput is lower with BFQ for two reasons.
>>>>
>>>> First, you certainly left the low_latency in its default state, i.e.,
>>>> on. As explained, e.g., here [2], low_latency mode is totally geared
>>>> towards maximum responsiveness and minimum latency for soft real-time
>>>> applications (e.g., video players). To achieve this goal, BFQ is
>>>> willing to perform more idling, when necessary. This lowers
>>>> throughput (I'll get back on this at the end of the discussion of the
>>>> second reason).
>>>
>>> changing low_latency to 0 seems not change anything, at least for the test:
>>> iodepth=1 numjobs=1 prio 2:6 A bs 4k:64k
>>>
>>>> The second, most important reason, is that a minimum of idling is the
>>>> *only* way to achieve differentiated bandwidth distribution, as you
>>>> requested by setting different ioprios. I stress that this constraint
>>>> is not a technological accident, but a intrinsic, logical necessity.
>>>> The proof is simple, and if the following explanation is too boring or
>>>> confusing, I can show it to you with any trace of sync I/O.
>>>>
>>>> First, to provide differentiated service, you need per-process
>>>> scheduling, i.e., schedulers in which there is a separate queue
>>>> associated with each process. Now, let A be the process with higher
>>>> weight (ioprio), and B the process with lower weight. Both processes
>>>> are sync, thus, by definition, they issue requests as follows: a few
>>>> requests (probably two, or a little bit more with larger iodepth),
>>>> then a little break to wait for request completion, then the next
>>>> small batch and so on. For each process, the queue associated with
>>>> the process (in the scheduler) is necessarily empty on the break. As
>>>> a consequence, if there is no idling, then every time A reaches its
>>>> break, the scheduler has only the option to switch to B (which is
>>>> extremely likely to have pending requests).
>>>>
>>>> The service pattern of the processes then unavoidably becomes:
>>>>
>>>> A B A B A B ...
>>>>
>>>> where each letter represents a full small batch served for the
>>>> process. That is, 50% of the bw for each process, and complete loss
>>>> of control on the desired bandwidth distribution.
>>>>
>>>> So, to sum up, the reason why BFQ achieves a lower total bw is that it
>>>> behaves in the only correct way to respect weights with sync I/O,
>>>> i.e., it performs a little idling. If low_latency is on, then BFQ
>>>> increases idling further, and this may be have caused further bw loss
>>>> in your test (but this varies greatly with devices, so you can
>>>> discover it only by trying).
>>>>
>>>> The bottom line is that if you do want to achieve differentiation with
>>>> sync I/O, you have to pay a price in terms of bw, because of idling.
>>>> Actually, the recent preemption mechanism that I have introduced in
>>>> BFQ is proving so effective in preserving differentiation, that I'm
>>>> tempted to try some almost idleness solution. A little of accuracy
>>>> should however be sacrificed. Anyway, this is still work in progress.
>>>
>>> Yep, I fully understand why idle is required here. As long as workload io depth
>>> is lower than queue io depth, idle is the only way to maintain fairness. This
>>> is the core of CFQ, I bet the same for BFQ. Unfortunately idle disk harms
>>> throughput too much especially for high end SSD.
>>>
>>
>> Then I'm afraid I have to give you very bad news: bw limiting causes
>> the same throughput loss. You can see it from your very tests. Here
>> is one of your results with BFQ (one that is likely to have been
>> affected less by the fact that you left low_latency on, or by further
>> issues that I may have not yet addressed thoroughly):
>>
>> iodepth=8 numjobs=1 prio 2:6
>> CFQ: 166:174 BFQ: 139:72 deadline: 202:202
>>
>> Here is, instead, your test with bw limitation:
>> iodepth=8 numjobs=1 prio 2:6, group A has 50M/s limit
>> CFQ:51:207 BFQ: 51:45 deadline: 51:216
>>
>> From the first test, you see that the total bw achievable by the
>> device is at least 404MB/s. But in the second test you get at most
>> 267MB/s, with deadline. In this respect, the total bw achieved by BFQ
>> in the first test is 211MB/s.
>>
>> So, both throttling and proportional share need to waste bw, BFQ
>> looses about 13% more of the total bw.
>
> I don't think you calculate this correct. In iodepth 8 request size 4k, the
> workload can only dispatch 216M/s. Even group A doesn't dispatch any IO, group
> B can only dispatch 216M/s. So deadline doesn't waste any bw. CFQ wastes 216 -
> 207, while BFQ wastes 216 - 45. That's the problem.
>

I'm sorry, I have chosen an ambiguous example for showing my point. I
thought of it too superficially, because it was very simple and
provided numbers on which we would have both agreed. Yes, group B
doesn't lose anything with deadline, no matter what limit we impose on
group A. That's why this is a case where one would never do any bw
limitation.

One needs bw limitation when some group is choked by other groups.
For this to happen, some of the groups causing the problem must be
receiving, each, less than the bw they can sustain. In this
situation, the total workload is pumping the device to the maximum
total bw achievable given the properties of the workload of each group
(iodepth, block size, degree of randomness, ...). If one intervenes
with group limits, the she/he must be quite clever in finding the right
limits for each offending group, such that a total bw close to the
no-limit case is still achieved.

If the system is dynamic, i.e., number of groups, as well as group
iodepths, block sizes and randomness may vary with time, then she/he
must be very lucky too, as she/he must find an automatic limit recomputation
that still achieves about maximum bw, however groups and workloads
change. IMO, depending on the system, achieving such a goal may become
virtually impossible.

>> In return, it gives you
>> incomparably better bw and latency guarantees, while allowing you to
>> configure your system with zero or minimal effort. In contrast, using
>> bw limits to properly configure a common system like, e.g., a large
>> file server, may become a nightmare for a sysadmin. For example, if,
>> in the simplest case, she/he configures limits for the worst-case,
>> then per-client limits will have to be extremely low. But the system
>> is large and dynamic, so the actual number of clients and at the
>> actual bw consumed by each client will vary without a break, even in
>> the short-medium term. The bw redistribution heuristic do not give
>> any provable guarantees on accuracy of bw redistribution. The result
>> will likely be highly varying client bandwidths, with unlucky clients
>> unjustly limited to low limits, and then experiencing high latencies.
>> The latter will be further emphasized by the intrinsically bursty
>> nature of throttling.
>
> I don't disagree here. The bw/iops throttling is not easy to configure. It's
> kind of low level configuration, people need to know the workload very well to
> configure. If we have way to do proportional scheduling, everybody will cheer
> up. The goal of the tests is checking if proportional scheduling is feasible,
> the result isn't very optimistic so far.

If you think that, in contrast, with bw limits you easily achieve
about maximum total bw, then your conclusion is correct. Above I have
tried to restate why this seems quite unlikely to me on moderately
complex, possibly dynamic systems.

> No, I don't say BFQ isn't good. It is
> much more fair than CFQ. I suppose it would work well for desktop workloads.
>> In addition, the scenario in your tests is the worst case for a
>> proportional share solution: in a generic system, such as the the file
>> server above, part of the workload is likely to be sequential or
>> quasi-sequential (at least in medium-length time intervals), and this
>> is enough to get very close to peak bw with a proportional-share
>> scheduler. No configuration needed. With bw throttling, you must do
>> the math very well to get peak bw all the time in a dynamic system.
>
> I'm afraid this is not true. Workloads seldomly fully utilize the bandwidth of
> a highend SSD.

Achieving full utilization in a complex system is certainly a rather
unlikely event. The matter here is how much more you may lose when
you have to provide bw or other service guarantees, to individual
users or groups. In this respect, I'm unfortunately too ignorant to
have an idea of the global distribution of significant workloads.

For this reason, in a case like this, I try to proceed with concrete
examples, to avoid hard-to-refute statements or positions. Here is
why I made the file-server example. In this example, workload is not
tied to be fully random. How relevant is this case? I don' know. To
me, it seems relevant. We could list other examples in which the
workload seems mostly sequential or quasi-sequential (video streaming,
audio streaming, packet-traffic dumping, ...). In all these case, it
seems to me that there may be bw to loose if the individual-bw
control mechanism is not well tuned.

> At least that's true here. My tests are actually pretty normal.
> I didn't do weird tests at all.
>

I don't think your tests are weird at all! Simply they focus on one,
yet very important, case. Is this the only significant case? My
examples above induce me to think that it is not. Just this.

Thanks,
Paolo

> Thanks,
> Shaohua
> --
> To unsubscribe from this list: send the line "unsubscribe linux-block" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html


--
Paolo Valente
Algogroup
Dipartimento di Scienze Fisiche, Informatiche e Matematiche
Via Campi 213/B
41125 Modena - Italy
http://algogroup.unimore.it/people/paolo/





2016-10-08 10:46:44

by Heinz Diehl

[permalink] [raw]
Subject: Re: [PATCH V3 00/11] block-throttle: add .high limit

On 06.10.2016, Linus Walleij wrote:

> Not just desktops, also Android phones.

> So why not have BFQ as a separate scheduling policy upstream,
> alongside CFQ, deadline and noop?

Please, just do it! CFQ and deadline perform horribly on Android, and
the same is true for desktop systems based on SSD drives.

What e.g. this test shows is true for all of my machines:
https://www.youtube.com/watch?v=1cjZeaCXIyM

I've been using BFQ nearly right from its inception, and it improves
all of my systems noticeably, in fact in such a positive way that I
don't update to a newer kernel unless there is a BFQ patch ready for
it.