Hi, this series contain a few patches to fix problem when on the default
hierarchy, corrent comment and so on. More detail can be found in
respective changelogs. Thanks.
---
V1->V2:
-Thanks for the review and advice from Tejun. The corrected comment of
"blk-throttle: correct stale comment in throtl_pd_init" and the
solution of "blk-throttle: Fix that bps of child could exceed bps
limited in parent" are from reply of Tejun.
-Collect Ack from Tejun.
-Fix the compile problem when CONFIG_BLK_DEV_THROTTLING_LOW is set.
-Drop "blk-throttle: Limit whole system if root group is configured
when on the default hierarchy", "blk-throttle: remove unnecessary check
for validation of limit index" and "blk-throttle: remove unused variable
td in tg_update_has_rules"
-Add "blk-throttle: correct stale comment in throtl_pd_init" and
"blk-throttle: avoid dead code in throtl_hierarchy_can_upgrade"
-Use solution that set the BIO_BPS_THROTTLED flag only when the bio
traversed the entire tree to fix that bps of child could exceed bps
limited in parent in patch 2/10.
-Improve the description and comment of most commits.
---
Kemeng Shi (10):
blk-throttle: correct stale comment in throtl_pd_init
blk-throttle: Fix that bps of child could exceed bps limited in parent
blk-throttle: ignore cgroup without io queued in
blk_throtl_cancel_bios
blk-throttle: correct calculation of wait time in tg_may_dispatch
blk-throttle: simpfy low limit reached check in throtl_tg_can_upgrade
blk-throttle: fix typo in comment of throtl_adjusted_limit
blk-throttle: remove incorrect comment for tg_last_low_overflow_time
blk-throttle: remove repeat check of elapsed time from last upgrade in
throtl_hierarchy_can_downgrade
blk-throttle: Use more siutable time_after check for update of
slice_start
blk-throttle: avoid dead code in throtl_hierarchy_can_upgrade
block/blk-throttle.c | 104 +++++++++++++++++++++++--------------------
1 file changed, 56 insertions(+), 48 deletions(-)
--
2.30.0
Commit c79892c557616 ("blk-throttle: add upgrade logic for LIMIT_LOW
state") added upgrade logic for low limit and methioned that
1. "To determine if a cgroup exceeds its limitation, we check if the cgroup
has pending request. Since cgroup is throttled according to the limit,
pending request means the cgroup reaches the limit."
2. "If a cgroup has limit set for both read and write, we consider the
combination of them for upgrade. The reason is read IO and write IO can
interfere with each other. If we do the upgrade based in one direction IO,
the other direction IO could be severly harmed."
Besides, we also determine that cgroup reaches low limit if low limit is 0,
see comment in throtl_tg_can_upgrade.
Collect the information above, the desgin of upgrade check is as following:
1.The low limit is reached if limit is zero or io is already queued.
2.Cgroup will pass upgrade check if low limits of READ and WRITE are both
reached.
Simpfy the check code described above to removce repeat check and improve
readability. There is no functional change.
Detail equivalence proof is as following:
All replaced conditions to return true are as following:
condition 1
(!read_limit && !write_limit)
condition 2
read_limit && sq->nr_queued[READ] && (!write_limit || sq->nr_queued[WRITE])
condition 3
write_limit && sq->nr_queued[WRITE] && (!read_limit || sq->nr_queued[READ])
Transferring condition 2 as following:
read_limit && sq->nr_queued[READ] && (!write_limit || sq->nr_queued[WRITE])
is equivalent to
(read_limit && sq->nr_queued[READ]) &&
(!write_limit || (write_limit && sq->nr_queued[WRITE]))
is equivalent to
condition 2.1
(read_limit && sq->nr_queued[READ] && !write_limit) ||
condition 2.2
(read_limit && sq->nr_queued[READ] &&
(write_limit && sq->nr_queued[WRITE]))
Transferring condition 3 as following:
write_limit && sq->nr_queued[WRITE] && (!read_limit || sq->nr_queued[READ])
is equivalent to
(write_limit && sq->nr_queued[WRITE]) &&
(!read_limit || (read_limit && sq->nr_queued[READ]))
is equivalent to
condition 3.1
((write_limit && sq->nr_queued[WRITE]) && !read_limit) ||
condition 3.2
((write_limit && sq->nr_queued[WRITE]) &&
(read_limit && sq->nr_queued[READ]))
Condition 3.2 is the same as condition 2.2, so all conditions we get to
return are as following:
(!read_limit && !write_limit) (1)
(!read_limit && (write_limit && sq->nr_queued[WRITE])) (3.1)
((read_limit && sq->nr_queued[READ]) && !write_limit) (2.1)
((write_limit && sq->nr_queued[WRITE]) &&
(read_limit && sq->nr_queued[READ])) (2.2)
As we can extract conditions "(a1 || a2) && (b1 || b2)" to:
a1 && b1
a1 && b2
a2 && b1
ab && b2
Considering that:
a1 = !read_limit
a2 = read_limit && sq->nr_queued[READ]
b1 = !write_limit
b2 = write_limit && sq->nr_queued[WRITE]
We can pack replaced conditions to
(!read_limint || (read_limit && sq->nr_queued[READ])) &&
(!write_limit || (write_limit && sq->nr_queued[WRITE])
which is equivalent to
(!read_limint || sq->nr_queued[READ]) &&
(!write_limit || sq->nr_queued[WRITE])
Signed-off-by: Kemeng Shi <[email protected]>
Reported-by: kernel test robot <[email protected]>
---
block/blk-throttle.c | 31 ++++++++++++++++++-------------
1 file changed, 18 insertions(+), 13 deletions(-)
diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index 06e4193b064e..4977fac56a00 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -1816,24 +1816,29 @@ static bool throtl_tg_is_idle(struct throtl_grp *tg)
return ret;
}
-static bool throtl_tg_can_upgrade(struct throtl_grp *tg)
+static bool throtl_tg_reach_low_limit(struct throtl_grp *tg, int rw)
{
struct throtl_service_queue *sq = &tg->service_queue;
- bool read_limit, write_limit;
+ bool limit = tg->bps[rw][LIMIT_LOW] || tg->iops[rw][LIMIT_LOW];
/*
- * if cgroup reaches low limit (if low limit is 0, the cgroup always
- * reaches), it's ok to upgrade to next limit
+ * if low limit is zero, low limit is always reached.
+ * if low limit is non-zero, we can check if there is any request
+ * is queued to determine if low limit is reached as we throttle
+ * request according to limit.
*/
- read_limit = tg->bps[READ][LIMIT_LOW] || tg->iops[READ][LIMIT_LOW];
- write_limit = tg->bps[WRITE][LIMIT_LOW] || tg->iops[WRITE][LIMIT_LOW];
- if (!read_limit && !write_limit)
- return true;
- if (read_limit && sq->nr_queued[READ] &&
- (!write_limit || sq->nr_queued[WRITE]))
- return true;
- if (write_limit && sq->nr_queued[WRITE] &&
- (!read_limit || sq->nr_queued[READ]))
+ return !limit || sq->nr_queued[rw];
+}
+
+static bool throtl_tg_can_upgrade(struct throtl_grp *tg)
+{
+ /*
+ * cgroup reaches low limit when low limit of READ and WRITE are
+ * both reached, it's ok to upgrade to next limit if cgroup reaches
+ * low limit
+ */
+ if (throtl_tg_reach_low_limit(tg, READ) &&
+ throtl_tg_reach_low_limit(tg, WRITE))
return true;
if (time_after_eq(jiffies,
--
2.30.0
On the default hierarchy (cgroup2), the throttle interface files don't
exist in the root cgroup, so the ablity to limit the whole system
by configuring root group is not existing anymore. In general, cgroup
doesn't wanna be in the business of restricting resources at the
system level, so correct the stale comment that we can limit whole
system to we can only limit subtree.
Signed-off-by: Kemeng Shi <[email protected]>
---
block/blk-throttle.c | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)
diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index 847721dc2b2b..59acfac87764 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -395,8 +395,9 @@ static void throtl_pd_init(struct blkg_policy_data *pd)
* If on the default hierarchy, we switch to properly hierarchical
* behavior where limits on a given throtl_grp are applied to the
* whole subtree rather than just the group itself. e.g. If 16M
- * read_bps limit is set on the root group, the whole system can't
- * exceed 16M for the device.
+ * read_bps limit is set on a "parent" group, summary bps of
+ * "parent" group and its subtree groups can't exceed 16M for the
+ * device.
*
* If not on the default hierarchy, the broken flat hierarchy
* behavior is retained where all throtl_grps are treated as if
--
2.30.0
There is no need to update tg->slice_start[rw] to start when they are
equal already. So remove "eq" part of check before update slice_start.
Signed-off-by: Kemeng Shi <[email protected]>
---
block/blk-throttle.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index 94d850b57462..4c80f2aa1e29 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -645,7 +645,7 @@ static inline void throtl_start_new_slice_with_credit(struct throtl_grp *tg,
* that bandwidth. Do try to make use of that bandwidth while giving
* credit.
*/
- if (time_after_eq(start, tg->slice_start[rw]))
+ if (time_after(start, tg->slice_start[rw]))
tg->slice_start[rw] = start;
tg->slice_end[rw] = jiffies + tg->td->throtl_slice;
--
2.30.0
Function throtl_hierarchy_can_upgrade will always return from while loop,
so the return outside while loop is never reached. Break the loop when
we traverse to root as throtl_hierarchy_can_downgrade do to avoid dead
code.
Signed-off-by: Kemeng Shi <[email protected]>
---
block/blk-throttle.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index 4c80f2aa1e29..9946524de158 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -1854,7 +1854,7 @@ static bool throtl_hierarchy_can_upgrade(struct throtl_grp *tg)
return true;
tg = sq_to_tg(tg->service_queue.parent_sq);
if (!tg || !tg_to_blkg(tg)->parent)
- return false;
+ break;
}
return false;
}
--
2.30.0
Ignore cgroup without io queued in blk_throtl_cancel_bios for two
reasons:
1. Save cpu cycle for trying to dispatch cgroup which is no io queued.
2. Avoid non-consistent state that cgroup is inserted to service queue
without THROTL_TG_PENDING set as tg_update_disptime will unconditional
re-insert cgroup to service queue. If we are on the default hierarchy,
IO dispatched from child in tg_dispatch_one_bio will trigger inserting
cgroup to service queue without erase first and ruin the tree.
Signed-off-by: Kemeng Shi <[email protected]>
Acked-by: Tejun Heo <[email protected]>
---
block/blk-throttle.c | 13 ++++++++++++-
1 file changed, 12 insertions(+), 1 deletion(-)
diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index d516a37e82fb..ee7dc1a0adfd 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -1738,7 +1738,18 @@ void blk_throtl_cancel_bios(struct gendisk *disk)
* Set the flag to make sure throtl_pending_timer_fn() won't
* stop until all throttled bios are dispatched.
*/
- blkg_to_tg(blkg)->flags |= THROTL_TG_CANCELING;
+ tg->flags |= THROTL_TG_CANCELING;
+
+ /*
+ * Do not dispatch cgroup without THROTL_TG_PENDING or cgroup
+ * will be inserted to service queue without THROTL_TG_PENDING
+ * set in tg_update_disptime below. Then IO dispatched from
+ * child in tg_dispatch_one_bio will trigger double insertion
+ * and corrupt the tree.
+ */
+ if (!(tg->flags & THROTL_TG_PENDING))
+ continue;
+
/*
* Update disptime after setting the above flag to make sure
* throtl_select_dispatch() won't exit without dispatching.
--
2.30.0
Function tg_last_low_overflow_time is called with intermediate node as
following:
throtl_hierarchy_can_downgrade
throtl_tg_can_downgrade
tg_last_low_overflow_time
throtl_hierarchy_can_upgrade
throtl_tg_can_upgrade
tg_last_low_overflow_time
throtl_hierarchy_can_downgrade/throtl_hierarchy_can_upgrade will traverse
from leaf node to sub-root node and pass traversed intermediate node
to tg_last_low_overflow_time.
No such limit could be found from context and implementation of
tg_last_low_overflow_time, so remove this limit in comment.
Signed-off-by: Kemeng Shi <[email protected]>
Acked-by: Tejun Heo <[email protected]>
---
block/blk-throttle.c | 1 -
1 file changed, 1 deletion(-)
diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index 4b11099625da..3ddd8a15aa3f 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -1762,7 +1762,6 @@ static unsigned long __tg_last_low_overflow_time(struct throtl_grp *tg)
return min(rtime, wtime);
}
-/* tg should not be an intermediate node */
static unsigned long tg_last_low_overflow_time(struct throtl_grp *tg)
{
struct throtl_service_queue *parent_sq;
--
2.30.0
There is no need to check elapsed time from last upgrade for each node in
hierarchy. Move this check before traversing as throtl_can_upgrade do
to remove repeat check.
Signed-off-by: Kemeng Shi <[email protected]>
Reported-by: kernel test robot <[email protected]>
---
block/blk-throttle.c | 8 ++++++--
1 file changed, 6 insertions(+), 2 deletions(-)
diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index 3ddd8a15aa3f..94d850b57462 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -1955,8 +1955,7 @@ static bool throtl_tg_can_downgrade(struct throtl_grp *tg)
* If cgroup is below low limit, consider downgrade and throttle other
* cgroups
*/
- if (time_after_eq(now, td->low_upgrade_time + td->throtl_slice) &&
- time_after_eq(now, tg_last_low_overflow_time(tg) +
+ if (time_after_eq(now, tg_last_low_overflow_time(tg) +
td->throtl_slice) &&
(!throtl_tg_is_idle(tg) ||
!list_empty(&tg_to_blkg(tg)->blkcg->css.children)))
@@ -1966,6 +1965,11 @@ static bool throtl_tg_can_downgrade(struct throtl_grp *tg)
static bool throtl_hierarchy_can_downgrade(struct throtl_grp *tg)
{
+ struct throtl_data *td = tg->td;
+
+ if (time_before(jiffies, td->low_upgrade_time + td->throtl_slice))
+ return false;
+
while (true) {
if (!throtl_tg_can_downgrade(tg))
return false;
--
2.30.0
lapsed time -> elapsed time
Signed-off-by: Kemeng Shi <[email protected]>
Acked-by: Tejun Heo <[email protected]>
---
block/blk-throttle.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index 4977fac56a00..4b11099625da 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -129,7 +129,7 @@ static struct throtl_data *sq_to_td(struct throtl_service_queue *sq)
/*
* cgroup's limit in LIMIT_MAX is scaled if low limit is set. This scale is to
* make the IO dispatch more smooth.
- * Scale up: linearly scale up according to lapsed time since upgrade. For
+ * Scale up: linearly scale up according to elapsed time since upgrade. For
* every throtl_slice, the limit scales up 1/2 .low limit till the
* limit hits .max limit
* Scale down: exponentially scale down if a cgroup doesn't hit its .low limit
--
2.30.0
On Tue, Nov 29, 2022 at 11:01:38AM +0800, Kemeng Shi wrote:
> On the default hierarchy (cgroup2), the throttle interface files don't
> exist in the root cgroup, so the ablity to limit the whole system
> by configuring root group is not existing anymore. In general, cgroup
> doesn't wanna be in the business of restricting resources at the
> system level, so correct the stale comment that we can limit whole
> system to we can only limit subtree.
>
> Signed-off-by: Kemeng Shi <[email protected]>
Acked-by: Tejun Heo <[email protected]>
> + * read_bps limit is set on a "parent" group, summary bps of
> + * "parent" group and its subtree groups can't exceed 16M for the
but why the quotes around parent?
--
tejun
On Tue, Nov 29, 2022 at 11:01:46AM +0800, Kemeng Shi wrote:
> There is no need to update tg->slice_start[rw] to start when they are
> equal already. So remove "eq" part of check before update slice_start.
>
> Signed-off-by: Kemeng Shi <[email protected]>
Acked-by: Tejun Heo <[email protected]>
--
tejun
On Tue, Nov 29, 2022 at 11:01:47AM +0800, Kemeng Shi wrote:
> Function throtl_hierarchy_can_upgrade will always return from while loop,
> so the return outside while loop is never reached. Break the loop when
> we traverse to root as throtl_hierarchy_can_downgrade do to avoid dead
> code.
I don't know why this is an improvement.
--
tejun
On Tue, Nov 29, 2022 at 11:01:45AM +0800, Kemeng Shi wrote:
> There is no need to check elapsed time from last upgrade for each node in
> hierarchy. Move this check before traversing as throtl_can_upgrade do
> to remove repeat check.
>
> Signed-off-by: Kemeng Shi <[email protected]>
> Reported-by: kernel test robot <[email protected]>
Acked-by: Tejun Heo <[email protected]>
--
tejun
On Tue, Nov 29, 2022 at 11:01:42AM +0800, Kemeng Shi wrote:
> Commit c79892c557616 ("blk-throttle: add upgrade logic for LIMIT_LOW
> state") added upgrade logic for low limit and methioned that
> 1. "To determine if a cgroup exceeds its limitation, we check if the cgroup
> has pending request. Since cgroup is throttled according to the limit,
> pending request means the cgroup reaches the limit."
> 2. "If a cgroup has limit set for both read and write, we consider the
> combination of them for upgrade. The reason is read IO and write IO can
> interfere with each other. If we do the upgrade based in one direction IO,
> the other direction IO could be severly harmed."
> Besides, we also determine that cgroup reaches low limit if low limit is 0,
> see comment in throtl_tg_can_upgrade.
>
> Collect the information above, the desgin of upgrade check is as following:
> 1.The low limit is reached if limit is zero or io is already queued.
> 2.Cgroup will pass upgrade check if low limits of READ and WRITE are both
> reached.
>
> Simpfy the check code described above to removce repeat check and improve
> readability. There is no functional change.
>
> Detail equivalence proof is as following:
> All replaced conditions to return true are as following:
> condition 1
> (!read_limit && !write_limit)
> condition 2
> read_limit && sq->nr_queued[READ] && (!write_limit || sq->nr_queued[WRITE])
> condition 3
> write_limit && sq->nr_queued[WRITE] && (!read_limit || sq->nr_queued[READ])
>
> Transferring condition 2 as following:
> read_limit && sq->nr_queued[READ] && (!write_limit || sq->nr_queued[WRITE])
> is equivalent to
> (read_limit && sq->nr_queued[READ]) &&
> (!write_limit || (write_limit && sq->nr_queued[WRITE]))
> is equivalent to
> condition 2.1
> (read_limit && sq->nr_queued[READ] && !write_limit) ||
> condition 2.2
> (read_limit && sq->nr_queued[READ] &&
> (write_limit && sq->nr_queued[WRITE]))
>
> Transferring condition 3 as following:
> write_limit && sq->nr_queued[WRITE] && (!read_limit || sq->nr_queued[READ])
> is equivalent to
> (write_limit && sq->nr_queued[WRITE]) &&
> (!read_limit || (read_limit && sq->nr_queued[READ]))
> is equivalent to
> condition 3.1
> ((write_limit && sq->nr_queued[WRITE]) && !read_limit) ||
> condition 3.2
> ((write_limit && sq->nr_queued[WRITE]) &&
> (read_limit && sq->nr_queued[READ]))
>
> Condition 3.2 is the same as condition 2.2, so all conditions we get to
> return are as following:
> (!read_limit && !write_limit) (1)
> (!read_limit && (write_limit && sq->nr_queued[WRITE])) (3.1)
> ((read_limit && sq->nr_queued[READ]) && !write_limit) (2.1)
> ((write_limit && sq->nr_queued[WRITE]) &&
> (read_limit && sq->nr_queued[READ])) (2.2)
>
> As we can extract conditions "(a1 || a2) && (b1 || b2)" to:
> a1 && b1
> a1 && b2
> a2 && b1
> ab && b2
>
> Considering that:
> a1 = !read_limit
> a2 = read_limit && sq->nr_queued[READ]
> b1 = !write_limit
> b2 = write_limit && sq->nr_queued[WRITE]
>
> We can pack replaced conditions to
> (!read_limint || (read_limit && sq->nr_queued[READ])) &&
> (!write_limit || (write_limit && sq->nr_queued[WRITE])
> which is equivalent to
> (!read_limint || sq->nr_queued[READ]) &&
^
typo
> (!write_limit || sq->nr_queued[WRITE])
Can you indent the whole thing a bit so that it's more readable?
> -static bool throtl_tg_can_upgrade(struct throtl_grp *tg)
> +static bool throtl_tg_reach_low_limit(struct throtl_grp *tg, int rw)
> {
> struct throtl_service_queue *sq = &tg->service_queue;
> - bool read_limit, write_limit;
> + bool limit = tg->bps[rw][LIMIT_LOW] || tg->iops[rw][LIMIT_LOW];
>
> /*
> - * if cgroup reaches low limit (if low limit is 0, the cgroup always
> - * reaches), it's ok to upgrade to next limit
> + * if low limit is zero, low limit is always reached.
> + * if low limit is non-zero, we can check if there is any request
> + * is queued to determine if low limit is reached as we throttle
> + * request according to limit.
> */
> - read_limit = tg->bps[READ][LIMIT_LOW] || tg->iops[READ][LIMIT_LOW];
> - write_limit = tg->bps[WRITE][LIMIT_LOW] || tg->iops[WRITE][LIMIT_LOW];
> - if (!read_limit && !write_limit)
> - return true;
> - if (read_limit && sq->nr_queued[READ] &&
> - (!write_limit || sq->nr_queued[WRITE]))
> - return true;
> - if (write_limit && sq->nr_queued[WRITE] &&
> - (!read_limit || sq->nr_queued[READ]))
> + return !limit || sq->nr_queued[rw];
> +}
> +
> +static bool throtl_tg_can_upgrade(struct throtl_grp *tg)
> +{
> + /*
> + * cgroup reaches low limit when low limit of READ and WRITE are
> + * both reached, it's ok to upgrade to next limit if cgroup reaches
> + * low limit
> + */
> + if (throtl_tg_reach_low_limit(tg, READ) &&
> + throtl_tg_reach_low_limit(tg, WRITE))
Can you please name it throtl_low_limit_reached()? Other than that,
Acked-by: Tejun Heo <[email protected]>
Thanks.
--
tejun
Hi, Tejun
on 12/1/2022 5:36 AM, Tejun Heo wrote:
> On Tue, Nov 29, 2022 at 11:01:47AM +0800, Kemeng Shi wrote:
>> Function throtl_hierarchy_can_upgrade will always return from while loop,
>> so the return outside while loop is never reached. Break the loop when
>> we traverse to root as throtl_hierarchy_can_downgrade do to avoid dead
>> code.
>
> I don't know why this is an improvement.
>
I just found that the "return false" outside of the while loop is never
executed which may be not reached by code coverage test. Of course,
we can simply remove this "return false", but I found a similar function
throtl_hierarchy_can_upgrade which has no such problem. So I avoid
unreachable code as throtl_hierarchy_can_upgrade do.
Is this make sense to you? If not, I will remove this patch in next
version.
Thanks.
--
Best wishes
Kemeng Shi