From: Jinke Han <[email protected]>
Now we don't distinguish sync write ios from normal buffer write ios
in blk-throtl. A bio with REQ_SYNC tagged always mean it will be wait
until write completion soon after it submit. So it's reasonable for sync
io to complete as soon as possible.
In our test, fio writes a 100g file in sequential 4k blocksize in
a container with low bps limit configured (wbps=10M). More than 1200
ios were throttled in blk-throtl queue and the avarage throtle time
of each io is 140s. At the same time, the operation of saving a small
file by vim will be blocked amolst 140s. As a fsync will be send by vim,
the sync ios of fsync will be blocked by a huge amount of buffer write
ios ahead. This is also a priority inversion problem within one cgroup.
In the database scene, things got really bad with blk-throtle enabled
as fsync is called very often.
This patch splits bio queue into sync and async queues for blk-throtl
and gives a huge priority to sync write ios. Sync queue only make sense
for write ios as we treat all read io as sync io. I think it's a nice
respond to the semantics of REQ_SYNC. Bios with REQ_META and REQ_PRIO
gains the same priority as they are important to fs. This may avoid
some potential priority inversion problems.
With this patch, do the same test above, the duration of the fsync sent
by vim drops to several hundreds of milliseconds.
Signed-off-by: Jinke Han <[email protected]>
Signed-off-by: Xin Yin <[email protected]>
---
Changes in v2
- Make code more simple.
Changes in v3
- Fix mismatch of waiting bio and the next dispatched bio.
- Rename dispatch_sync_cnt to disp_sync_cnt.
- Add more notes.
block/blk-throttle.c | 135 ++++++++++++++++++++++++++++++++++++++++---
block/blk-throttle.h | 13 ++++-
2 files changed, 139 insertions(+), 9 deletions(-)
diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index 847721dc2b2b..c4db7873dfa8 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -21,6 +21,13 @@
/* Total max dispatch from all groups in one round */
#define THROTL_QUANTUM 32
+/* For write ios, dispatch 4 sync ios and 1 normal io in one loop */
+#define THROTL_SYNC_FACTOR 4
+
+/* Only make sense for write ios, all read ios are treated as SYNC */
+#define SYNC 0
+#define ASYNC 1
+
/* Throttling is performed over a slice and after that slice is renewed */
#define DFL_THROTL_SLICE_HD (HZ / 10)
#define DFL_THROTL_SLICE_SSD (HZ / 50)
@@ -88,6 +95,7 @@ struct throtl_data
};
static void throtl_pending_timer_fn(struct timer_list *t);
+static inline struct bio *throtl_qnode_bio_list_pop(struct throtl_qnode *qn);
static inline struct blkcg_gq *tg_to_blkg(struct throtl_grp *tg)
{
@@ -241,11 +249,28 @@ static inline unsigned int throtl_bio_data_size(struct bio *bio)
return bio->bi_iter.bi_size;
}
-static void throtl_qnode_init(struct throtl_qnode *qn, struct throtl_grp *tg)
+static void throtl_qnode_init(struct throtl_qnode *qn, struct throtl_grp *tg,
+ bool rw)
{
INIT_LIST_HEAD(&qn->node);
- bio_list_init(&qn->bios);
+ bio_list_init(&qn->bios[SYNC]);
+ bio_list_init(&qn->bios[ASYNC]);
qn->tg = tg;
+ qn->disp_sync_cnt = (rw == READ) ? UINT_MAX : 0;
+ qn->next_to_disp = NULL;
+}
+
+#define BLK_THROTL_SYNC(bio) (bio->bi_opf & (REQ_SYNC | REQ_META | REQ_PRIO))
+
+static inline void throtl_qnode_add_bio_list(struct throtl_qnode *qn,
+ struct bio *bio)
+{
+ bool rw = bio_data_dir(bio);
+
+ if ((rw == READ) || BLK_THROTL_SYNC(bio))
+ bio_list_add(&qn->bios[SYNC], bio);
+ else
+ bio_list_add(&qn->bios[ASYNC], bio);
}
/**
@@ -261,13 +286,45 @@ static void throtl_qnode_init(struct throtl_qnode *qn, struct throtl_grp *tg)
static void throtl_qnode_add_bio(struct bio *bio, struct throtl_qnode *qn,
struct list_head *queued)
{
- bio_list_add(&qn->bios, bio);
+ throtl_qnode_add_bio_list(qn, bio);
if (list_empty(&qn->node)) {
list_add_tail(&qn->node, queued);
blkg_get(tg_to_blkg(qn->tg));
}
}
+/**
+ * throtl_qnode_bio_peek - peek a bio for a qn
+ * @qn: the qnode to peek from
+ *
+ * For read qn, just peek bio from the SYNC queue and return.
+ * For write qn, we first ask the next_to_disp for bio and will pop a bio
+ * to fill it if it's NULL. The next_to_disp is used to pin the bio for
+ * next to dispatch. It is necessary. In the dispatching process, a peeked
+ * bio may can't be dispatched due to lack of budget and has to wait, the
+ * dispatching process may give up and the spin lock of the request queue
+ * will be released. New bio may be queued in as the spin lock were released.
+ * When it's time to dispatch the waiting bio, another bio may be selected to
+ * check the limit and may be dispatched. If the dispatched bio is smaller
+ * than the waiting bio, the bandwidth may be hard to satisfied as we may
+ * trim the slice after each dispatch.
+ * So pinning the next_to_disp to make sure that the waiting bio and the
+ * dispatched one later always the same one in case that the spin lock of
+ * queue was released and re-holded.
+ */
+static inline struct bio *throtl_qnode_bio_peek(struct throtl_qnode *qn)
+{
+ /* qn for read ios */
+ if (qn->disp_sync_cnt == UINT_MAX)
+ return bio_list_peek(&qn->bios[SYNC]);
+
+ /* qn for write ios */
+ if (!qn->next_to_disp)
+ qn->next_to_disp = throtl_qnode_bio_list_pop(qn);
+
+ return qn->next_to_disp;
+}
+
/**
* throtl_peek_queued - peek the first bio on a qnode list
* @queued: the qnode list to peek
@@ -281,11 +338,73 @@ static struct bio *throtl_peek_queued(struct list_head *queued)
return NULL;
qn = list_first_entry(queued, struct throtl_qnode, node);
- bio = bio_list_peek(&qn->bios);
+ bio = throtl_qnode_bio_peek(qn);
WARN_ON_ONCE(!bio);
return bio;
}
+/**
+ * throtl_qnode_bio_pop: pop a bio from sync/async queue
+ * @qn: the qnode to pop a bio from
+ *
+ * For write io qn, the target queue to pop was determined by the disp_sync_cnt.
+ * Try to pop bio from target queue, fetch the bio and return it when it is not
+ * empty. If the target queue empty, pop bio from another queue instead.
+ */
+static inline struct bio *throtl_qnode_bio_list_pop(struct throtl_qnode *qn)
+{
+ struct bio *bio;
+ int from = SYNC;
+
+ if (qn->disp_sync_cnt == THROTL_SYNC_FACTOR)
+ from = ASYNC;
+
+ bio = bio_list_pop(&qn->bios[from]);
+ if (!bio) {
+ from = 1 - from;
+ bio = bio_list_pop(&qn->bios[from]);
+ }
+
+ if ((qn->disp_sync_cnt < THROTL_SYNC_FACTOR) &&
+ (from == SYNC))
+ qn->disp_sync_cnt++;
+ else
+ qn->disp_sync_cnt = 0;
+
+ return bio;
+}
+
+/**
+ * throtl_qnode_bio_pop: pop a bio from a qnode
+ * @qn: the qnode to pop a bio from
+ */
+static inline struct bio *throtl_qnode_bio_pop(struct throtl_qnode *qn)
+{
+ struct bio *bio;
+
+ /* qn for read ios */
+ if (qn->disp_sync_cnt == UINT_MAX)
+ return bio_list_pop(&qn->bios[SYNC]);
+
+ /* qn for write ios */
+ if (qn->next_to_disp) {
+ bio = qn->next_to_disp;
+ qn->next_to_disp = NULL;
+ return bio;
+ }
+
+ return throtl_qnode_bio_list_pop(qn);
+}
+
+static inline bool throtl_qnode_empty(struct throtl_qnode *qn)
+{
+ if (!qn->next_to_disp &&
+ bio_list_empty(&qn->bios[SYNC]) &&
+ bio_list_empty(&qn->bios[ASYNC]))
+ return true;
+ return false;
+}
+
/**
* throtl_pop_queued - pop the first bio form a qnode list
* @queued: the qnode list to pop a bio from
@@ -310,10 +429,10 @@ static struct bio *throtl_pop_queued(struct list_head *queued,
return NULL;
qn = list_first_entry(queued, struct throtl_qnode, node);
- bio = bio_list_pop(&qn->bios);
+ bio = throtl_qnode_bio_pop(qn);
WARN_ON_ONCE(!bio);
- if (bio_list_empty(&qn->bios)) {
+ if (throtl_qnode_empty(qn)) {
list_del_init(&qn->node);
if (tg_to_put)
*tg_to_put = qn->tg;
@@ -355,8 +474,8 @@ static struct blkg_policy_data *throtl_pd_alloc(gfp_t gfp,
throtl_service_queue_init(&tg->service_queue);
for (rw = READ; rw <= WRITE; rw++) {
- throtl_qnode_init(&tg->qnode_on_self[rw], tg);
- throtl_qnode_init(&tg->qnode_on_parent[rw], tg);
+ throtl_qnode_init(&tg->qnode_on_self[rw], tg, rw);
+ throtl_qnode_init(&tg->qnode_on_parent[rw], tg, rw);
}
RB_CLEAR_NODE(&tg->rb_node);
diff --git a/block/blk-throttle.h b/block/blk-throttle.h
index ef4b7a4de987..bd4fff65b6d9 100644
--- a/block/blk-throttle.h
+++ b/block/blk-throttle.h
@@ -28,8 +28,19 @@
*/
struct throtl_qnode {
struct list_head node; /* service_queue->queued[] */
- struct bio_list bios; /* queued bios */
+ struct bio_list bios[2]; /* queued bios */
struct throtl_grp *tg; /* tg this qnode belongs to */
+
+ struct bio *next_to_disp; /* pinned for next to dispatch */
+ /*
+ * 1) for write throtl_qnode:
+ * [0, THROTL_SYNC_FACTOR-1]: dispatch sync io
+ * [THROTL_SYNC_FACTOR]: dispatch async io
+ *
+ * 2) for read throtl_qnode:
+ * UINT_MAX
+ */
+ unsigned int disp_sync_cnt; /* sync io dispatch counter */
};
struct throtl_service_queue {
--
2.20.1
Hi Jinke,
Thank you for the patch! Perhaps something to improve:
[auto build test WARNING on axboe-block/for-next]
[also build test WARNING on linus/master v6.2-rc1 next-20221226]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]
url: https://github.com/intel-lab-lkp/linux/commits/Jinke-Han/blk-throtl-Introduce-sync-and-async-queues-for-blk-throtl/20221226-210601
base: https://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux-block.git for-next
patch link: https://lore.kernel.org/r/20221226130505.7186-1-hanjinke.666%40bytedance.com
patch subject: [PATCH v3] blk-throtl: Introduce sync and async queues for blk-throtl
config: x86_64-rhel-8.3-func
compiler: gcc-11 (Debian 11.3.0-8) 11.3.0
reproduce (this is a W=1 build):
# https://github.com/intel-lab-lkp/linux/commit/bb8c9fad48016fc6e9e63ac695f445b6814c4703
git remote add linux-review https://github.com/intel-lab-lkp/linux
git fetch --no-tags linux-review Jinke-Han/blk-throtl-Introduce-sync-and-async-queues-for-blk-throtl/20221226-210601
git checkout bb8c9fad48016fc6e9e63ac695f445b6814c4703
# save the config file
mkdir build_dir && cp config build_dir/.config
make W=1 O=build_dir ARCH=x86_64 olddefconfig
make W=1 O=build_dir ARCH=x86_64 SHELL=/bin/bash
If you fix the issue, kindly add following tag where applicable
| Reported-by: kernel test robot <[email protected]>
All warnings (new ones prefixed by >>):
>> block/blk-throttle.c:355: warning: expecting prototype for throtl_qnode_bio_pop(). Prototype was for throtl_qnode_bio_list_pop() instead
vim +355 block/blk-throttle.c
345
346 /**
347 * throtl_qnode_bio_pop: pop a bio from sync/async queue
348 * @qn: the qnode to pop a bio from
349 *
350 * For write io qn, the target queue to pop was determined by the disp_sync_cnt.
351 * Try to pop bio from target queue, fetch the bio and return it when it is not
352 * empty. If the target queue empty, pop bio from another queue instead.
353 */
354 static inline struct bio *throtl_qnode_bio_list_pop(struct throtl_qnode *qn)
> 355 {
356 struct bio *bio;
357 int from = SYNC;
358
359 if (qn->disp_sync_cnt == THROTL_SYNC_FACTOR)
360 from = ASYNC;
361
362 bio = bio_list_pop(&qn->bios[from]);
363 if (!bio) {
364 from = 1 - from;
365 bio = bio_list_pop(&qn->bios[from]);
366 }
367
368 if ((qn->disp_sync_cnt < THROTL_SYNC_FACTOR) &&
369 (from == SYNC))
370 qn->disp_sync_cnt++;
371 else
372 qn->disp_sync_cnt = 0;
373
374 return bio;
375 }
376
--
0-DAY CI Kernel Test Service
https://01.org/lkp
Hello,
On Mon, Dec 26, 2022 at 09:05:05PM +0800, Jinke Han wrote:
> static void throtl_pending_timer_fn(struct timer_list *t);
> +static inline struct bio *throtl_qnode_bio_list_pop(struct throtl_qnode *qn);
Just define it before the first usage? Also, I think it'd be fine to let the
compiler decide whether to inline.
> +#define BLK_THROTL_SYNC(bio) (bio->bi_opf & (REQ_SYNC | REQ_META | REQ_PRIO))
Nitpick but the above is used only in one place. Does it help to define it
as a macro?
> +/**
> + * throtl_qnode_bio_peek - peek a bio for a qn
> + * @qn: the qnode to peek from
> + *
> + * For read qn, just peek bio from the SYNC queue and return.
> + * For write qn, we first ask the next_to_disp for bio and will pop a bio
> + * to fill it if it's NULL. The next_to_disp is used to pin the bio for
> + * next to dispatch. It is necessary. In the dispatching process, a peeked
> + * bio may can't be dispatched due to lack of budget and has to wait, the
> + * dispatching process may give up and the spin lock of the request queue
> + * will be released. New bio may be queued in as the spin lock were released.
> + * When it's time to dispatch the waiting bio, another bio may be selected to
> + * check the limit and may be dispatched. If the dispatched bio is smaller
> + * than the waiting bio, the bandwidth may be hard to satisfied as we may
> + * trim the slice after each dispatch.
> + * So pinning the next_to_disp to make sure that the waiting bio and the
> + * dispatched one later always the same one in case that the spin lock of
> + * queue was released and re-holded.
Can you please format it better and proof-read it. I can mostly understand
what it's saying but it can be improved quite a bit. Can you elaborate the
starvation scenario further? What about the [a]sync queue split makes this
more likely?
> +/**
> + * throtl_qnode_bio_pop: pop a bio from sync/async queue
> + * @qn: the qnode to pop a bio from
> + *
> + * For write io qn, the target queue to pop was determined by the disp_sync_cnt.
> + * Try to pop bio from target queue, fetch the bio and return it when it is not
> + * empty. If the target queue empty, pop bio from another queue instead.
How about:
For reads, always pop from the ASYNC queue. For writes, target SYNC
or ASYNC queue based on disp_sync_cnt. If empty, try the other
queue.
> +static inline struct bio *throtl_qnode_bio_list_pop(struct throtl_qnode *qn)
> +{
> + struct bio *bio;
> + int from = SYNC;
> +
> + if (qn->disp_sync_cnt == THROTL_SYNC_FACTOR)
> + from = ASYNC;
?: often is less readable but I wonder whether it'd be more readable here:
from = qn->disp_sync_cnt == THROTL_SYNC_FACTOR ? ASYNC : SYNC;
> +
> + bio = bio_list_pop(&qn->bios[from]);
> + if (!bio) {
> + from = 1 - from;
> + bio = bio_list_pop(&qn->bios[from]);
> + }
> +
> + if ((qn->disp_sync_cnt < THROTL_SYNC_FACTOR) &&
> + (from == SYNC))
Why the line break? Also, this may be more personal preference but I'm not
sure the parentheses are helping much here.
> + qn->disp_sync_cnt++;
> + else
> + qn->disp_sync_cnt = 0;
> +
> + return bio;
> +}
Thanks.
--
tejun
在 2023/1/5 上午6:11, Tejun Heo 写道:
> Hello,
>
> On Mon, Dec 26, 2022 at 09:05:05PM +0800, Jinke Han wrote:
>> static void throtl_pending_timer_fn(struct timer_list *t);
>> +static inline struct bio *throtl_qnode_bio_list_pop(struct throtl_qnode *qn);
>
> Just define it before the first usage? Also, I think it'd be fine to let the
> compiler decide whether to inline.
>
>> +#define BLK_THROTL_SYNC(bio) (bio->bi_opf & (REQ_SYNC | REQ_META | REQ_PRIO))
>
> Nitpick but the above is used only in one place. Does it help to define it
> as a macro?
>
>> +/**
>> + * throtl_qnode_bio_peek - peek a bio for a qn
>> + * @qn: the qnode to peek from
>> + *
>> + * For read qn, just peek bio from the SYNC queue and return.
>> + * For write qn, we first ask the next_to_disp for bio and will pop a bio
>> + * to fill it if it's NULL. The next_to_disp is used to pin the bio for
>> + * next to dispatch. It is necessary. In the dispatching process, a peeked
>> + * bio may can't be dispatched due to lack of budget and has to wait, the
>> + * dispatching process may give up and the spin lock of the request queue
>> + * will be released. New bio may be queued in as the spin lock were released.
>> + * When it's time to dispatch the waiting bio, another bio may be selected to
>> + * check the limit and may be dispatched. If the dispatched bio is smaller
>> + * than the waiting bio, the bandwidth may be hard to satisfied as we may
>> + * trim the slice after each dispatch.
>> + * So pinning the next_to_disp to make sure that the waiting bio and the
>> + * dispatched one later always the same one in case that the spin lock of
>> + * queue was released and re-holded.
>
> Can you please format it better and proof-read it. I can mostly understand
> what it's saying but it can be improved quite a bit. Can you elaborate the
> starvation scenario further? What about the [a]sync queue split makes this
> more likely?
>
>> +/**
>> + * throtl_qnode_bio_pop: pop a bio from sync/async queue
>> + * @qn: the qnode to pop a bio from
>> + *
>> + * For write io qn, the target queue to pop was determined by the disp_sync_cnt.
>> + * Try to pop bio from target queue, fetch the bio and return it when it is not
>> + * empty. If the target queue empty, pop bio from another queue instead.
>
> How about:
>
> For reads, always pop from the ASYNC queue. For writes, target SYNC
> or ASYNC queue based on disp_sync_cnt. If empty, try the other
> queue.
>
>> +static inline struct bio *throtl_qnode_bio_list_pop(struct throtl_qnode *qn)
>> +{
>> + struct bio *bio;
>> + int from = SYNC;
>> +
>> + if (qn->disp_sync_cnt == THROTL_SYNC_FACTOR)
>> + from = ASYNC;
>
> ?: often is less readable but I wonder whether it'd be more readable here:
>
> from = qn->disp_sync_cnt == THROTL_SYNC_FACTOR ? ASYNC : SYNC;
>
>> +
>> + bio = bio_list_pop(&qn->bios[from]);
>> + if (!bio) {
>> + from = 1 - from;
>> + bio = bio_list_pop(&qn->bios[from]);
>> + }
>> +
>> + if ((qn->disp_sync_cnt < THROTL_SYNC_FACTOR) &&
>> + (from == SYNC))
>
> Why the line break? Also, this may be more personal preference but I'm not
> sure the parentheses are helping much here.
>
>> + qn->disp_sync_cnt++;
>> + else
>> + qn->disp_sync_cnt = 0;
>> +
>> + return bio;
>> +}
>
> Thanks.
>
Thanks. Your suggestion is detailed and helpful. I will accept it and
send the v4.
Thanks.
Hello Jinke.
On Mon, Dec 26, 2022 at 09:05:05PM +0800, Jinke Han <[email protected]> wrote:
> In our test, fio writes a 100g file in sequential 4k blocksize in
> a container with low bps limit configured (wbps=10M).
> [...]
> At the same time, the operation of saving a small file by vim will be
> blocked amolst 140s.
Could you please elaborate why is this specific to blk-throtl?
I guess similar problem would arise for devices that are "naturally"
slow.
Then:
a) it must have been solved elsewhere in the block layer (but it's
broken),
b) it should be solved generically in the block layer (thus this is only
a partial solution).
Alternatively, I imagine your problem could be reduced with
corresponding memory limits on IO-constrained cgroups. (The memory limit
would increase cgwb's dirty throttling and consequently leaves more
IO bandwidth for sync IOs.)
Could you describe how the submitted solution compares to memory
limiting?
> This patch splits bio queue into sync and async queues for blk-throtl
> and gives a huge priority to sync write ios.
The "huge priority" corresponds to the THROTL_SYNC_FACTOR, right?
I'm slightly concerned about the introduction of the magical value.
What is the reasoning behind this? (E.g. I'd expect 1:1 could work as
well, while 1:4 suggests this is somehow better (empirically?).)
Thanks,
Michal
Hello,
On Thu, Jan 05, 2023 at 05:18:54PM +0100, Michal Koutn? wrote:
> I guess similar problem would arise for devices that are "naturally"
> slow.
> Then:
> a) it must have been solved elsewhere in the block layer (but it's
> broken),
> b) it should be solved generically in the block layer (thus this is only
> a partial solution).
Hard limits tend to make this sort of problems a lot more pronounced because
the existing mechanisms tend to break down for the users which are severely
throttled down even while the device as a whole is fairly idle. cpu.max
often triggers severe priority inversions too, so it isn't too surprising
that people hit severe priority inversion issues w/ io.max.
Another problem with blk-throttle is that it doesn't prioritize shared IOs
identified by bio_issue_as_root_blkg() like iolatency and iocost do, so
there can be very severe priority inversions when e.g. journal commit gets
trapped in a low priority cgroup further exacerbating issues like this.
Thanks.
--
tejun
On Thu, Jan 05, 2023 at 07:35:59AM -1000, Tejun Heo <[email protected]> wrote:
> Hard limits tend to make this sort of problems a lot more pronounced because
> the existing mechanisms tend to break down for the users which are severely
> throttled down even while the device as a whole is fairly idle. cpu.max
> often triggers severe priority inversions too, so it isn't too surprising
> that people hit severe priority inversion issues w/ io.max.
To be on the same page:
1) severe PI == priority inversion across cgroups (progated e.g. via
global locks (as with cpu.max) or FS journal (as with io.max)),
2) ordinary PI == priority inversion contained within a single cgroup,
i.e. no different from an under-provisioned system.
The reported issue sounds like 2) but even with the separated queues 1)
is still possible :-/
> Another problem with blk-throttle is that it doesn't prioritize shared IOs
> identified by bio_issue_as_root_blkg() like iolatency and iocost do, so
> there can be very severe priority inversions when e.g. journal commit gets
> trapped in a low priority cgroup further exacerbating issues like this.
Thanks for the broader view. So the separated queues are certainly an
improvement but ultimately a mechanism based on bio_issue_as_root_blkg()
predicate and deferred throttling would be better? Or is permanent limit
enforcement more important?
Thanks,
Michal
Hello, Michal.
On Thu, Jan 05, 2023 at 08:22:47PM +0100, Michal Koutn? wrote:
> On Thu, Jan 05, 2023 at 07:35:59AM -1000, Tejun Heo <[email protected]> wrote:
> > Hard limits tend to make this sort of problems a lot more pronounced because
> > the existing mechanisms tend to break down for the users which are severely
> > throttled down even while the device as a whole is fairly idle. cpu.max
> > often triggers severe priority inversions too, so it isn't too surprising
> > that people hit severe priority inversion issues w/ io.max.
>
> To be on the same page:
> 1) severe PI == priority inversion across cgroups (progated e.g. via
> global locks (as with cpu.max) or FS journal (as with io.max)),
Another important inversion vector is memory reclaim as writeback operations
get trapped in blk-throttle and the system can pile on waiting for clean
pages.
> 2) ordinary PI == priority inversion contained within a single cgroup,
> i.e. no different from an under-provisioned system.
I didn't use the term in a precise manner but yeah both can be problematic.
The former a lot more so.
> The reported issue sounds like 2) but even with the separated queues 1)
> is still possible :-/
Yeah, definitely.
> > Another problem with blk-throttle is that it doesn't prioritize shared IOs
> > identified by bio_issue_as_root_blkg() like iolatency and iocost do, so
> > there can be very severe priority inversions when e.g. journal commit gets
> > trapped in a low priority cgroup further exacerbating issues like this.
>
> Thanks for the broader view. So the separated queues are certainly an
> improvement but ultimately a mechanism based on bio_issue_as_root_blkg()
> predicate and deferred throttling would be better? Or is permanent limit
> enforcement more important?
Generally true but the specific scenario raised by Jinke is rather unusual
and isn't covered by issue_as_root mechanism as it doesn't promote REQ_SYNC
data writes. Then again, this usually isn't a problem as in most cases dirty
throttling through writeback should stave off severe starvations.
blk-throttle is pretty outdated and kinda broken and separating out sync
writes isn't gonna solve most of its problems. However, it will help in some
cases like the one described by Jinke and can sometimes lessen wider scope
inversions too. Given the partial nature, I don't feel too enthusiastic but
at the same time can't find good reasons to reject either. I don't know. I
don't feel too strong either way.
On a tangential note, I've been thinking about implementing io.max on top of
iocost so that each cgroup can just configure the max % of total device IO
capacity that's allowed for it, which should be a lot easier to use and has
many of the problems already solved.
Thanks.
--
tejun
On Thu 05-01-23 17:18:54, Michal Koutn? wrote:
> Hello Jinke.
>
> On Mon, Dec 26, 2022 at 09:05:05PM +0800, Jinke Han <[email protected]> wrote:
> > In our test, fio writes a 100g file in sequential 4k blocksize in
> > a container with low bps limit configured (wbps=10M).
> > [...]
> > At the same time, the operation of saving a small file by vim will be
> > blocked amolst 140s.
>
> Could you please elaborate why is this specific to blk-throtl?
>
> I guess similar problem would arise for devices that are "naturally"
> slow.
> Then:
> a) it must have been solved elsewhere in the block layer (but it's
> broken),
> b) it should be solved generically in the block layer (thus this is only
> a partial solution).
Generally, problems are this are taken care of by IO schedulers. E.g. BFQ
has quite a lot of logic exactly to reduce problems like this. Sync and
async queues are one part of this logic inside BFQ (but there's more).
But given current architecture of the block layer IO schedulers are below
throttling frameworks such as blk-throtl so they have no chance of
influencing problems like this. So we are bound to reinvent the scheduling
logic IO schedulers are already doing. That being said I don't have a good
solution for this or architecture suggestion. Because implementing various
throttling frameworks within IO schedulers is cumbersome (complex
interactions) and generally the perfomance is too slow for some usecases.
We've been there (that's why there's cgroup support in BFQ) and really
the current architecture is much easier to reason about.
Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR
Hello,
On Fri, Jan 06, 2023 at 04:38:13PM +0100, Jan Kara wrote:
> Generally, problems are this are taken care of by IO schedulers. E.g. BFQ
> has quite a lot of logic exactly to reduce problems like this. Sync and
> async queues are one part of this logic inside BFQ (but there's more).
With modern ssd's, even deadline's overhead is too high and a lot (but
clearly not all) of what the IO schedulers do are no longer necessary. I
don't see a good way back to elevators.
> But given current architecture of the block layer IO schedulers are below
> throttling frameworks such as blk-throtl so they have no chance of
> influencing problems like this. So we are bound to reinvent the scheduling
> logic IO schedulers are already doing. That being said I don't have a good
> solution for this or architecture suggestion. Because implementing various
> throttling frameworks within IO schedulers is cumbersome (complex
> interactions) and generally the perfomance is too slow for some usecases.
> We've been there (that's why there's cgroup support in BFQ) and really
> the current architecture is much easier to reason about.
Another layering problem w/ controlling from elevators is that that's after
request allocation and the issuer has already moved on. We used to have
per-cgroup rq pools but ripped that out, so it's pretty easy to cause severe
priority inversions by depleting the shared request pool, and the fact that
throttling takes place after the issuing task returned from issue path makes
propagating the throttling operation upwards more challenging too.
At least in terms of cgroup control, the new bio based behavior is a lot
better. In the fb fleet, iocost is deployed on most (virtually all) of the
machines and we don't see issues with severe priority inversions.
Cross-cgroup control is pretty well controlled. Inside each cgroup, sync
writes aren't prioritized but nobody seems to be troubled by that.
My bet is that inversion issues are a lot more severe with blk-throttle
because it's not work-conserving and not doing things like issue-as-root or
other measures to alleviate issues which can arise from inversions.
Jinke, is the case you described in the original email what you actually saw
in production or a simplified test case for demonstration? If the latter,
can you describe actual problems seen in production?
Thanks.
--
tejun
在 2023/1/7 上午12:58, Tejun Heo 写道:
>
> Jinke, is the case you described in the original email what you actually saw
> in production or a simplified test case for demonstration? If the latter,
> can you describe actual problems seen in production?
>
> Thanks.
>
Hi
In our internal scenario, iocost has been deployed as the main io
isolation method and is gradually spreading。
But for some specific scenarios with old kernel versions, blk-throtl
is alose needed. The scenario described in my email is in the early
stage of research and extensive testing for it. During this period,some
priority inversion issues amoug cgroups or within one cgroup have been
observed. So I send this patch to try to fix or mitigate some of these
issues.
Thanks
Jinke
Hello!
On Fri 06-01-23 06:58:05, Tejun Heo wrote:
> Hello,
>
> On Fri, Jan 06, 2023 at 04:38:13PM +0100, Jan Kara wrote:
> > Generally, problems are this are taken care of by IO schedulers. E.g. BFQ
> > has quite a lot of logic exactly to reduce problems like this. Sync and
> > async queues are one part of this logic inside BFQ (but there's more).
>
> With modern ssd's, even deadline's overhead is too high and a lot (but
> clearly not all) of what the IO schedulers do are no longer necessary. I
> don't see a good way back to elevators.
Yeah, I agree there's no way back :). But actually I think a lot of the
functionality of IO schedulers is not needed (by you ;)) only because the
HW got performant enough and so some issues became less visible. And that
is all fine but if you end up in a configuration where your cgroup's IO
limits and IO demands are similar to how the old rotational disks were
underprovisioned for the amount of IO needed to be done by the system
(i.e., you can easily generate amount of IO that then takes minutes or tens
of minutes for your IO subsystem to crunch through), you hit all the same
problems IO schedulers were trying to solve again. And maybe these days we
incline more towards the answer "buy more appropriate HW / buy higher
limits from your infrastructure provider" but it is not like the original
issues in such configurations disappeared.
> > But given current architecture of the block layer IO schedulers are below
> > throttling frameworks such as blk-throtl so they have no chance of
> > influencing problems like this. So we are bound to reinvent the scheduling
> > logic IO schedulers are already doing. That being said I don't have a good
> > solution for this or architecture suggestion. Because implementing various
> > throttling frameworks within IO schedulers is cumbersome (complex
> > interactions) and generally the perfomance is too slow for some usecases.
> > We've been there (that's why there's cgroup support in BFQ) and really
> > the current architecture is much easier to reason about.
>
> Another layering problem w/ controlling from elevators is that that's after
> request allocation and the issuer has already moved on. We used to have
> per-cgroup rq pools but ripped that out, so it's pretty easy to cause severe
> priority inversions by depleting the shared request pool, and the fact that
> throttling takes place after the issuing task returned from issue path makes
> propagating the throttling operation upwards more challenging too.
Well, we do have .limit_depth IO scheduler callback these days so BFQ uses
that to solve the problem of exhaustion of shared request pool but I agree
it's a bit of a hack on the side.
> At least in terms of cgroup control, the new bio based behavior is a lot
> better. In the fb fleet, iocost is deployed on most (virtually all) of the
> machines and we don't see issues with severe priority inversions.
> Cross-cgroup control is pretty well controlled. Inside each cgroup, sync
> writes aren't prioritized but nobody seems to be troubled by that.
>
> My bet is that inversion issues are a lot more severe with blk-throttle
> because it's not work-conserving and not doing things like issue-as-root or
> other measures to alleviate issues which can arise from inversions.
Yes, I agree these features of blk-throttle make the problems much more
likely to happen in practice.
Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR
Hello, Jan.
On Mon, Jan 09, 2023 at 11:59:16AM +0100, Jan Kara wrote:
> Yeah, I agree there's no way back :). But actually I think a lot of the
> functionality of IO schedulers is not needed (by you ;)) only because the
> HW got performant enough and so some issues became less visible. And that
> is all fine but if you end up in a configuration where your cgroup's IO
> limits and IO demands are similar to how the old rotational disks were
> underprovisioned for the amount of IO needed to be done by the system
> (i.e., you can easily generate amount of IO that then takes minutes or tens
> of minutes for your IO subsystem to crunch through), you hit all the same
> problems IO schedulers were trying to solve again. And maybe these days we
> incline more towards the answer "buy more appropriate HW / buy higher
> limits from your infrastructure provider" but it is not like the original
> issues in such configurations disappeared.
Yeah, but I think there's a better way out as there's still a difference
between the two situations. W/ hard disks, you're actually out of bandwidth.
With SSDs, we know that there are capacity that we can borrow to get out of
the tough spot. e.g. w/ iocost, you can constrain a cgroup to a point where
its throughput gets to a simliar level of hard disks; however, that still
doesn't (or at least shouldn't) cause noticeable priority inversions outside
of that cgroup because issue_as_root promotes the IOs which can be waited
upon by other cgroups to root charging the cost to the cgroup as debts and
further slowing it down afterwards.
There's a lot to be improved - e.g. the debt accounting and payback, and
propagation to originator throttling isn't very accurate leading to usually
over-throttling and under-utilization in some cases. The coupling between IO
control and dirty throttling is there and kinda works but it seems like it's
pretty easy to make it misbehave under heavy control and so on. But, even
with all those shortcomings, at least iocost is feature complete and already
works (not perfectly but still) in most cases - it can actually distribute
IO bandwidth across the cgroups with arbitrary weights without causing
noticeable priority inversions across cgroups.
blk-throttle unfortunately doesn't have issue_as_root and the issuer delay
mechanism hooked up and we found that it's near impossible to configure
properly in any scalable manner. Raw bw and iops limits just can't capture
application behavior variances well enough. Often, the valid parameter space
becomes null when trying to cover varied behaviors. Given the problem is
pretty fundamental for the control scheme, I largely gave up on it with the
long term goal of implementing io.max on top of iocost down the line.
> > Another layering problem w/ controlling from elevators is that that's after
> > request allocation and the issuer has already moved on. We used to have
> > per-cgroup rq pools but ripped that out, so it's pretty easy to cause severe
> > priority inversions by depleting the shared request pool, and the fact that
> > throttling takes place after the issuing task returned from issue path makes
> > propagating the throttling operation upwards more challenging too.
>
> Well, we do have .limit_depth IO scheduler callback these days so BFQ uses
> that to solve the problem of exhaustion of shared request pool but I agree
> it's a bit of a hack on the side.
Ah didn't know about that. Yeah, that'd help the situation to some degree.
> > My bet is that inversion issues are a lot more severe with blk-throttle
> > because it's not work-conserving and not doing things like issue-as-root or
> > other measures to alleviate issues which can arise from inversions.
>
> Yes, I agree these features of blk-throttle make the problems much more
> likely to happen in practice.
As I wrote above, I largely gave up on blk-throttle and things like tweaking
sync write priority doesn't address most of its problems (e.g. it's still
gonna be super easy to stall the whole system with a heavily throttled
cgroup). However, it can still be useful for some use cases and if it can be
tweaked to become a bit better, I don't see a reason to not do that.
Thanks.
--
tejun
Hello.
Thanks all for sharing ideas and more details (in time-previous messages).
On Sat, Jan 07, 2023 at 02:07:38AM +0800, hanjinke <[email protected]> wrote:
> But for some specific scenarios with old kernel versions, blk-throtl
> is alose needed. The scenario described in my email is in the early stage of
> research and extensive testing for it. During this period,some priority
> inversion issues amoug cgroups or within one cgroup have been observed. So I
> send this patch to try to fix or mitigate some of these issues.
Jinke, do you combine blk-throtl with memory limits? (As that could in theory
indirectly reduce async requests as dirtier would be slowed down.)
Thanks,
Michal
在 2023/1/11 下午8:35, Michal Koutný 写道:
> Hello.
>
> Thanks all for sharing ideas and more details (in time-previous messages).
>
> On Sat, Jan 07, 2023 at 02:07:38AM +0800, hanjinke <[email protected]> wrote:
>> But for some specific scenarios with old kernel versions, blk-throtl
>> is alose needed. The scenario described in my email is in the early stage of
>> research and extensive testing for it. During this period,some priority
>> inversion issues amoug cgroups or within one cgroup have been observed. So I
>> send this patch to try to fix or mitigate some of these issues.
>
> Jinke, do you combine blk-throtl with memory limits? (As that could in theory
> indirectly reduce async requests as dirtier would be slowed down.)
>
Hi
In fact, some of those tests above are done in vm with only total 16g
memory.
Agree with what you said, if we further tighten the memory usage, things
will get better because this will indirectly reduces writeback pages in
blk-throtl queue.
Thanks.