2021-03-25 06:59:36

by brookxu.cn

[permalink] [raw]
Subject: [PATCH v3 00/14] bfq: introduce bfq.ioprio for cgroup

From: Chunguang Xu <[email protected]>

Any suggestions or discussions are welcome, thank you every much.

BACKGROUND:
In the container scenario, in addition to throughput, we also pay
attention to Qos of each group. Based on hierarchical scheduling,
EMQ, IO Injection, bfq.weight and other mechanisms, we can achieve
better IO isolation, better throughput, better avoiding priority
inversion. However, we still have something to be optimized.

OPTIMIZATION:

We try to do something to make bfq work better in the container scene.

1. Introduce bfq.ioprio
Tasks in the production environment can be roughly divided into
three categories: emergency, ordinary and offline. Emergency tasks
need to be scheduled in real time, such as system agents. Offline
tasks do not need to guarantee QoS, but can improve system resource
utilization during system idle periods, such as background tasks.

At present, we can use weights to simulate IO preemption, but since
weights are more of a share concept, they cannot be simulated well.
Using ioprio class for group, we can solve the above problems more
easier. In this way, in hierarchical scheduling, we can ensure that
RT and IDLE group can be scheduled correctly. In addition, we also
introduce ioprio for group, so we realize a weight value through
ioprio class and ioprio. In scenarios where only simple weights are
needed, we can achieve IO preemption and weight isolation only
through bfq.ioprio.

After the introduction of bfq.ioprio, in order to better adapt to
the actual container environment. When scheduling within a group,
we use task ioprio class. But outside of group, we use group ioprio
class. For example, when counting bfqd->busy_queues[], tasks from the
CLASS_IDLE group are always regarded as CLASS_IDLE, and the ioprio
class of the task is ignored.

2. Introduce better_faireness mode
Better Qos control needs to sacrifice throughput, and this is not
suitable for all scenarios. For this, we added a switch called
better_fairness. After better_fairness is enabled, we will make
the following restrictions:

Guarantee group Qos:
1. Cooperator queue can only come from the same group and the same class.
2. Waker queue can only come from the same group and the same class.
3. Inject queue can only come from the same group of the same class.

Guarantee RT tasks Qos:
1. Async_queue cannot inject RT queue.
2. Traverse the upper schedule domain to determine whether
in_service_queue needs to be preempted.
3. If in_service_queue marked prio_expire, disable idle.

Better Buffer IO control:
1. Except for the CLASS_IDLE queue, other queues allow idle by default.

INTERFACE:

The bfq.ioprio interface now is available for cgroup v1 and cgroup
v2. Users can configure the ioprio for cgroup through this
interface, as shown below:

echo "1 2"> blkio.bfq.ioprio

The above two values respectively represent the values of ioprio
class and ioprio for cgroup.

EXPERIMENT:

The test process is as follows:
# prepare data disk
mount /dev/sdb /data1

# prepare IO scheduler
echo bfq > /sys/block/sdb/queue/scheduler
echo 0 > /sys/block/sdb/queue/iosched/low_latency
echo 1 > /sys/block/sdb/queue/iosched/better_fairness

It is worth noting here that nr_requests limits the number of
requests, and it does not perceive priority. If nr_requests is
too small, it may cause a serious priority inversion problem.
Therefore, we can increase the size of nr_requests based on
the actual situation.

# create cgroup v1 hierarchy
cd /sys/fs/cgroup/blkio
mkdir rt be0 be1 be2 idle

# prepare cgroup
echo "1 0" > rt/blkio.bfq.ioprio
echo "2 0" > be0/blkio.bfq.ioprio
echo "2 4" > be1/blkio.bfq.ioprio
echo "2 7" > be2/blkio.bfq.ioprio
echo "3 0" > idle/blkio.bfq.ioprio

# run fio test
fio fio.ini

# generate svg graph
fio_generate_plots res

The contents of fio.ini are as follows:
[global]
ioengine=libaio
group_reporting=1
log_avg_msec=3000
direct=1
time_based=1
iodepth=16
size=100M
rw=write
bs=1M
[rt]
name=rt
write_bw_log=rt
write_lat_log=rt
write_iops_log=rt
filename=/data1/rt.bin
cgroup=rt
runtime=30s
nice=-10
[be0]
name=be0
write_bw_log=be0
write_lat_log=be0
write_iops_log=be0
filename=/data1/be0.bin
cgroup=be0
runtime=60s
[be1]
name=be1
write_bw_log=be1
write_lat_log=be1
write_iops_log=be1
filename=/data1/be1.bin
cgroup=be1
runtime=60s
[be2]
name=be2
write_bw_log=be2
write_lat_log=be2
write_iops_log=be2
filename=/data1/be2.bin
cgroup=be2
runtime=60s
[idle]
name=idle
write_bw_log=idle
write_lat_log=idle
write_iops_log=idle
filename=/data1/idle.bin
cgroup=idle
runtime=90s

V3:
1. introdule prio_expire for bfqq.
2. introduce better_fairness mode.
3. optimize the processing of task ioprio and group ioprio.
4. optimize some small points

V2:
1. Optmise bfq_select_next_class().
2. Introduce bfq_group [] to track the number of groups for each CLASS.
3. Optimse IO injection, EMQ and Idle mechanism for CLASS_RT.

Chunguang Xu (14):
bfq: introduce bfq_entity_to_bfqg helper method
bfq: convert the type of bfq_group.bfqd to bfq_data*
bfq: introduce bfq.ioprio for cgroup
bfq: introduce bfq_ioprio_class to get ioprio class
bfq: limit the IO depth of CLASS_IDLE to 1
bfq: keep the minimun bandwidth for CLASS_BE
bfq: introduce better_fairness for container scene
bfq: introduce prio_expire flag for bfq_queue
bfq: expire in_serv_queue for prio_expire under better_fairness
bfq: optimize IO injection under better_fairness
bfq: disable idle for prio_expire under better_fairness
bfq: disable merging between different groups under better_fairness
bfq: remove unnecessary initialization logic
bfq: optimize the calculation of bfq_weight_to_ioprio()

block/bfq-cgroup.c | 99 ++++++++++++++++++++++++++---
block/bfq-iosched.c | 119 +++++++++++++++++++++++++++++++---
block/bfq-iosched.h | 36 +++++++++--
block/bfq-wf2q.c | 180 ++++++++++++++++++++++++++++++++++++++++++----------
4 files changed, 376 insertions(+), 58 deletions(-)

--
1.8.3.1


2021-03-25 06:59:44

by brookxu.cn

[permalink] [raw]
Subject: [PATCH v3 02/14] bfq: convert the type of bfq_group.bfqd to bfq_data*

From: Chunguang Xu <[email protected]>

Setting bfq_group.bfqd to void* type does not seem to make much sense.
This will cause unnecessary type conversion. Perhaps it would be better
to change it to bfq_data* type.

Signed-off-by: Chunguang Xu <[email protected]>
---
block/bfq-cgroup.c | 2 +-
block/bfq-iosched.h | 2 +-
block/bfq-wf2q.c | 6 +++---
3 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/block/bfq-cgroup.c b/block/bfq-cgroup.c
index a5f544a..50d06c7 100644
--- a/block/bfq-cgroup.c
+++ b/block/bfq-cgroup.c
@@ -224,7 +224,7 @@ void bfqg_stats_update_io_add(struct bfq_group *bfqg, struct bfq_queue *bfqq,
{
blkg_rwstat_add(&bfqg->stats.queued, op, 1);
bfqg_stats_end_empty_time(&bfqg->stats);
- if (!(bfqq == ((struct bfq_data *)bfqg->bfqd)->in_service_queue))
+ if (!(bfqq == bfqg->bfqd->in_service_queue))
bfqg_stats_set_start_group_wait_time(bfqg, bfqq_group(bfqq));
}

diff --git a/block/bfq-iosched.h b/block/bfq-iosched.h
index a6f98e9..28d8590 100644
--- a/block/bfq-iosched.h
+++ b/block/bfq-iosched.h
@@ -914,7 +914,7 @@ struct bfq_group {
struct bfq_entity entity;
struct bfq_sched_data sched_data;

- void *bfqd;
+ struct bfq_data *bfqd;

struct bfq_queue *async_bfqq[2][IOPRIO_BE_NR];
struct bfq_queue *async_idle_bfqq;
diff --git a/block/bfq-wf2q.c b/block/bfq-wf2q.c
index 5ff0028..276f225 100644
--- a/block/bfq-wf2q.c
+++ b/block/bfq-wf2q.c
@@ -498,7 +498,7 @@ static void bfq_active_insert(struct bfq_service_tree *st,
#ifdef CONFIG_BFQ_GROUP_IOSCHED
sd = entity->sched_data;
bfqg = container_of(sd, struct bfq_group, sched_data);
- bfqd = (struct bfq_data *)bfqg->bfqd;
+ bfqd = bfqg->bfqd;
#endif
if (bfqq)
list_add(&bfqq->bfqq_list, &bfqq->bfqd->active_list);
@@ -597,7 +597,7 @@ static void bfq_active_extract(struct bfq_service_tree *st,
#ifdef CONFIG_BFQ_GROUP_IOSCHED
sd = entity->sched_data;
bfqg = container_of(sd, struct bfq_group, sched_data);
- bfqd = (struct bfq_data *)bfqg->bfqd;
+ bfqd = bfqg->bfqd;
#endif
if (bfqq)
list_del(&bfqq->bfqq_list);
@@ -743,7 +743,7 @@ struct bfq_service_tree *
else {
sd = entity->my_sched_data;
bfqg = container_of(sd, struct bfq_group, sched_data);
- bfqd = (struct bfq_data *)bfqg->bfqd;
+ bfqd = bfqg->bfqd;
}
#endif

--
1.8.3.1

2021-03-25 07:00:06

by brookxu.cn

[permalink] [raw]
Subject: [PATCH v3 04/14] bfq: introduce bfq_ioprio_class to get ioprio class

From: Chunguang Xu <[email protected]>

Since the tasks inside the container itself have different
ioprio, in order to be compatible with the actual production
environment, when scheduling within a group, we use the task
ioprio class, but outside the group, we use the group ioprio
class. For example, when counting busy_queues, tasks from
the CLASS_IDLE group, regardless of their ioprio, are always
treated as CLASS_IDLE tasks.

Signed-off-by: Chunguang Xu <[email protected]>
---
block/bfq-iosched.c | 29 ++++++++++++++++++++++++++---
block/bfq-iosched.h | 1 +
block/bfq-wf2q.c | 4 ++--
3 files changed, 29 insertions(+), 5 deletions(-)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index ec482e6..5f7a0cc 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -428,7 +428,30 @@ void bfq_schedule_dispatch(struct bfq_data *bfqd)
}
}

-#define bfq_class_idle(bfqq) ((bfqq)->ioprio_class == IOPRIO_CLASS_IDLE)
+unsigned short bfq_ioprio_class(struct bfq_entity *entity)
+{
+ struct bfq_queue *bfqq = bfq_entity_to_bfqq(entity);
+ unsigned short class = BFQ_DEFAULT_GRP_CLASS;
+#ifdef CONFIG_BFQ_GROUP_IOSCHED
+ struct bfq_group *bfqg;
+
+ if (bfqq) {
+ bfqg = bfqq_group(bfqq);
+ class = bfqg->ioprio_class?:bfqq->ioprio_class;
+ } else {
+ bfqg = bfq_entity_to_bfqg(entity);
+ class = bfqg->ioprio_class?:BFQ_DEFAULT_GRP_CLASS;
+ }
+#else
+ if (bfqq)
+ class = bfqq->ioprio_class;
+#endif
+ return class;
+}
+
+#define bfq_class(bfq) (bfq_ioprio_class(&bfq->entity))
+#define bfq_class_rt(bfq) (bfq_ioprio_class(&bfq->entity) == IOPRIO_CLASS_RT)
+#define bfq_class_idle(bfq) (bfq_ioprio_class(&bfq->entity) == IOPRIO_CLASS_IDLE)

#define bfq_sample_valid(samples) ((samples) > 80)

@@ -1635,7 +1658,7 @@ static bool bfq_bfqq_higher_class_or_weight(struct bfq_queue *bfqq,
{
int bfqq_weight, in_serv_weight;

- if (bfqq->ioprio_class < in_serv_bfqq->ioprio_class)
+ if (bfq_class(bfqq) < bfq_class(in_serv_bfqq))
return true;

if (in_serv_bfqq->entity.parent == bfqq->entity.parent) {
@@ -2600,7 +2623,7 @@ static bool bfq_may_be_close_cooperator(struct bfq_queue *bfqq,
return false;

if (bfq_class_idle(bfqq) || bfq_class_idle(new_bfqq) ||
- (bfqq->ioprio_class != new_bfqq->ioprio_class))
+ (bfq_class(bfqq) != bfq_class(new_bfqq)))
return false;

/*
diff --git a/block/bfq-iosched.h b/block/bfq-iosched.h
index 3416a75..29a56b8 100644
--- a/block/bfq-iosched.h
+++ b/block/bfq-iosched.h
@@ -1071,6 +1071,7 @@ void bfq_requeue_bfqq(struct bfq_data *bfqd, struct bfq_queue *bfqq,
void bfq_del_bfqq_busy(struct bfq_data *bfqd, struct bfq_queue *bfqq,
bool expiration);
void bfq_add_bfqq_busy(struct bfq_data *bfqd, struct bfq_queue *bfqq);
+unsigned short bfq_ioprio_class(struct bfq_entity *entity);

/* --------------- end of interface of B-WF2Q+ ---------------- */

diff --git a/block/bfq-wf2q.c b/block/bfq-wf2q.c
index 7405be9..c91109e 100644
--- a/block/bfq-wf2q.c
+++ b/block/bfq-wf2q.c
@@ -1702,7 +1702,7 @@ void bfq_del_bfqq_busy(struct bfq_data *bfqd, struct bfq_queue *bfqq,

bfq_clear_bfqq_busy(bfqq);

- bfqd->busy_queues[bfqq->ioprio_class - 1]--;
+ bfqd->busy_queues[bfq_ioprio_class(&bfqq->entity) - 1]--;

if (bfqq->wr_coeff > 1)
bfqd->wr_busy_queues--;
@@ -1725,7 +1725,7 @@ void bfq_add_bfqq_busy(struct bfq_data *bfqd, struct bfq_queue *bfqq)
bfq_activate_bfqq(bfqd, bfqq);

bfq_mark_bfqq_busy(bfqq);
- bfqd->busy_queues[bfqq->ioprio_class - 1]++;
+ bfqd->busy_queues[bfq_ioprio_class(&bfqq->entity) - 1]++;

if (!bfqq->dispatched)
if (bfqq->wr_coeff == 1)
--
1.8.3.1

2021-03-25 07:00:06

by brookxu.cn

[permalink] [raw]
Subject: [PATCH v3 05/14] bfq: limit the IO depth of CLASS_IDLE to 1

From: Chunguang Xu <[email protected]>

The IO depth of queues belong to CLASS_IDLE is limited to 1,
so that it can avoid introducing a larger tail latency under
a device with a larger IO depth. Although limiting the IO
depth may reduce the performance of idle_class, it is
generally not a big problem, because idle_class usually does
not have strict performance requirements.

Signed-off-by: Chunguang Xu <[email protected]>
---
block/bfq-iosched.c | 11 +++++++++++
1 file changed, 11 insertions(+)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index 5f7a0cc..8eaf0eb 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -4831,6 +4831,17 @@ static struct request *__bfq_dispatch_request(struct blk_mq_hw_ctx *hctx)
if (!bfqq)
goto exit;

+ /*
+ * Here, the IO depth of queues belong to CLASS_IDLE is limited
+ * to 1, so that it can avoid introducing a larger tail latency
+ * under a device with a larger IO depth. Although limiting the
+ * IO depth may reduce the performance of idle_class, it is
+ * generally not a big problem, because idle_class usually
+ * does not have strict performance requirements.
+ */
+ if (bfq_class_idle(bfqq) && bfqq->dispatched)
+ goto exit;
+
rq = bfq_dispatch_rq_from_bfqq(bfqd, bfqq);

if (rq) {
--
1.8.3.1

2021-03-25 07:00:06

by brookxu.cn

[permalink] [raw]
Subject: [PATCH v3 06/14] bfq: keep the minimun bandwidth for CLASS_BE

From: Chunguang Xu <[email protected]>

CLASS_RT will preempt other classes, which may starve. At
present, CLASS_IDLE has alleviated the starvation problem
through the minimum bandwidth mechanism. Similarly, we
should do the same for CLASS_BE.

Signed-off-by: Chunguang Xu <[email protected]>
---
block/bfq-iosched.c | 6 ++++--
block/bfq-iosched.h | 11 ++++++----
block/bfq-wf2q.c | 59 ++++++++++++++++++++++++++++++++++++++---------------
3 files changed, 53 insertions(+), 23 deletions(-)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index 8eaf0eb..ee8c457 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -6560,9 +6560,11 @@ static void bfq_init_root_group(struct bfq_group *root_group,
root_group->bfqd = bfqd;
#endif
root_group->rq_pos_tree = RB_ROOT;
- for (i = 0; i < BFQ_IOPRIO_CLASSES; i++)
+ for (i = 0; i < BFQ_IOPRIO_CLASSES; i++) {
root_group->sched_data.service_tree[i] = BFQ_SERVICE_TREE_INIT;
- root_group->sched_data.bfq_class_idle_last_service = jiffies;
+ root_group->sched_data.bfq_class_last_service[i] = jiffies;
+ }
+ root_group->sched_data.class_timeout_last_check = jiffies;
}

static int bfq_init_queue(struct request_queue *q, struct elevator_type *e)
diff --git a/block/bfq-iosched.h b/block/bfq-iosched.h
index 29a56b8..f9ed1da 100644
--- a/block/bfq-iosched.h
+++ b/block/bfq-iosched.h
@@ -13,7 +13,7 @@
#include "blk-cgroup-rwstat.h"

#define BFQ_IOPRIO_CLASSES 3
-#define BFQ_CL_IDLE_TIMEOUT (HZ/5)
+#define BFQ_CLASS_TIMEOUT (HZ/5)

#define BFQ_MIN_WEIGHT 1
#define BFQ_MAX_WEIGHT 1000
@@ -97,9 +97,12 @@ struct bfq_sched_data {
struct bfq_entity *next_in_service;
/* array of service trees, one per ioprio_class */
struct bfq_service_tree service_tree[BFQ_IOPRIO_CLASSES];
- /* last time CLASS_IDLE was served */
- unsigned long bfq_class_idle_last_service;
-
+ /* last time the class was served */
+ unsigned long bfq_class_last_service[BFQ_IOPRIO_CLASSES];
+ /* last time class timeout was checked */
+ unsigned long class_timeout_last_check;
+ /* next index to check class timeout */
+ unsigned int next_class_index;
};

/**
diff --git a/block/bfq-wf2q.c b/block/bfq-wf2q.c
index c91109e..1f8f3c5 100644
--- a/block/bfq-wf2q.c
+++ b/block/bfq-wf2q.c
@@ -1188,6 +1188,7 @@ bool __bfq_deactivate_entity(struct bfq_entity *entity, bool ins_into_idle_tree)
{
struct bfq_sched_data *sd = entity->sched_data;
struct bfq_service_tree *st;
+ int idx = bfq_class_idx(entity);
bool is_in_service;

if (!entity->on_st_or_in_serv) /*
@@ -1227,6 +1228,7 @@ bool __bfq_deactivate_entity(struct bfq_entity *entity, bool ins_into_idle_tree)
else
bfq_idle_insert(st, entity);

+ sd->bfq_class_last_service[idx] = jiffies;
return true;
}

@@ -1455,6 +1457,45 @@ static struct bfq_entity *bfq_first_active_entity(struct bfq_service_tree *st,
return entity;
}

+static int bfq_select_next_class(struct bfq_sched_data *sd)
+{
+ struct bfq_service_tree *st = sd->service_tree;
+ unsigned long last_check, last_serve;
+ int i, class_idx, next_class = 0;
+ bool found = false;
+
+ /*
+ * we needed to guarantee a minimum bandwidth for each class (if
+ * there is some active entity in this class). This should also
+ * mitigate priority-inversion problems in case a low priority
+ * task is holding file system resources.
+ */
+ last_check = sd->class_timeout_last_check;
+ if (time_is_after_jiffies(last_check + BFQ_CLASS_TIMEOUT))
+ return next_class;
+
+ sd->class_timeout_last_check = jiffies;
+ for (i = 0; i < BFQ_IOPRIO_CLASSES; i++) {
+ class_idx = (sd->next_class_index + i) % BFQ_IOPRIO_CLASSES;
+ last_serve = sd->bfq_class_last_service[class_idx];
+
+ if (time_is_after_jiffies(last_serve + BFQ_CLASS_TIMEOUT))
+ continue;
+
+ if (!RB_EMPTY_ROOT(&(st + class_idx)->active)) {
+ if (found)
+ continue;
+
+ next_class = class_idx++;
+ class_idx %= BFQ_IOPRIO_CLASSES;
+ sd->next_class_index = class_idx;
+ found = true;
+ }
+ sd->bfq_class_last_service[class_idx] = jiffies;
+ }
+ return next_class;
+}
+
/**
* bfq_lookup_next_entity - return the first eligible entity in @sd.
* @sd: the sched_data.
@@ -1468,24 +1509,8 @@ static struct bfq_entity *bfq_lookup_next_entity(struct bfq_sched_data *sd,
bool expiration)
{
struct bfq_service_tree *st = sd->service_tree;
- struct bfq_service_tree *idle_class_st = st + (BFQ_IOPRIO_CLASSES - 1);
struct bfq_entity *entity = NULL;
- int class_idx = 0;
-
- /*
- * Choose from idle class, if needed to guarantee a minimum
- * bandwidth to this class (and if there is some active entity
- * in idle class). This should also mitigate
- * priority-inversion problems in case a low priority task is
- * holding file system resources.
- */
- if (time_is_before_jiffies(sd->bfq_class_idle_last_service +
- BFQ_CL_IDLE_TIMEOUT)) {
- if (!RB_EMPTY_ROOT(&idle_class_st->active))
- class_idx = BFQ_IOPRIO_CLASSES - 1;
- /* About to be served if backlogged, or not yet backlogged */
- sd->bfq_class_idle_last_service = jiffies;
- }
+ int class_idx = bfq_select_next_class(sd);

/*
* Find the next entity to serve for the highest-priority
--
1.8.3.1

2021-03-25 07:00:10

by brookxu.cn

[permalink] [raw]
Subject: [PATCH v3 07/14] bfq: introduce better_fairness for container scene

From: Chunguang Xu <[email protected]>

In the container scenario, in addition to throughput, we
also pay attention to Qos. In order to better support this
scenario, we introduce the better_fairness mode here. In
this mode, we expect to control the Qos of each group
according to its priority better. Only add configuration
interface here

Signed-off-by: Chunguang Xu <[email protected]>
---
block/bfq-iosched.c | 22 ++++++++++++++++++++++
block/bfq-iosched.h | 10 ++++++++++
2 files changed, 32 insertions(+)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index ee8c457..e7bc5e2 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -6745,6 +6745,7 @@ static ssize_t __FUNC(struct elevator_queue *e, char *page) \
SHOW_FUNCTION(bfq_max_budget_show, bfqd->bfq_user_max_budget, 0);
SHOW_FUNCTION(bfq_timeout_sync_show, bfqd->bfq_timeout, 1);
SHOW_FUNCTION(bfq_strict_guarantees_show, bfqd->strict_guarantees, 0);
+SHOW_FUNCTION(bfq_better_fairness_show, bfqd->better_fairness, 0);
SHOW_FUNCTION(bfq_low_latency_show, bfqd->low_latency, 0);
#undef SHOW_FUNCTION

@@ -6886,6 +6887,26 @@ static ssize_t bfq_strict_guarantees_store(struct elevator_queue *e,
return count;
}

+static ssize_t bfq_better_fairness_store(struct elevator_queue *e,
+ const char *page, size_t count)
+{
+ struct bfq_data *bfqd = e->elevator_data;
+ unsigned long __data;
+ int ret;
+
+ ret = bfq_var_store(&__data, (page));
+ if (ret)
+ return ret;
+
+ if (__data > 1)
+ __data = 1;
+ if (__data == 0 && bfqd->better_fairness != 0)
+ bfq_end_wr(bfqd);
+ bfqd->better_fairness = __data;
+
+ return count;
+}
+
static ssize_t bfq_low_latency_store(struct elevator_queue *e,
const char *page, size_t count)
{
@@ -6919,6 +6940,7 @@ static ssize_t bfq_low_latency_store(struct elevator_queue *e,
BFQ_ATTR(max_budget),
BFQ_ATTR(timeout_sync),
BFQ_ATTR(strict_guarantees),
+ BFQ_ATTR(better_fairness),
BFQ_ATTR(low_latency),
__ATTR_NULL
};
diff --git a/block/bfq-iosched.h b/block/bfq-iosched.h
index f9ed1da..674de8b 100644
--- a/block/bfq-iosched.h
+++ b/block/bfq-iosched.h
@@ -672,6 +672,16 @@ struct bfq_data {
bool strict_guarantees;

/*
+ * If there is no prio preemption, we force the device to
+ * idle to ensure Qos. IO inject also has some additional
+ * restrictions. The inject/merge queue should come from the
+ * same class in the same group. Doing so will reduce the
+ * throughput of the system, but it can better guarantee
+ * the Qos of each group and real-time tasks.
+ */
+ bool better_fairness;
+
+ /*
* Last time at which a queue entered the current burst of
* queues being activated shortly after each other; for more
* details about this and the following parameters related to
--
1.8.3.1

2021-03-25 07:00:44

by brookxu.cn

[permalink] [raw]
Subject: [PATCH v3 10/14] bfq: optimize IO injection under better_fairness

From: Chunguang Xu <[email protected]>

In order to ensure better Qos of tasks of different groups
and different classes under better_fairness, we only allow
the queues of the same class in the same group can be
injected.

Signed-off-by: Chunguang Xu <[email protected]>
---
block/bfq-iosched.c | 28 +++++++++++++++++++++++++++-
1 file changed, 27 insertions(+), 1 deletion(-)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index 51192bd..be5b1e3 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -1900,6 +1900,27 @@ static void bfq_reset_inject_limit(struct bfq_data *bfqd,
bfqq->decrease_time_jif = jiffies;
}

+static bool bfq_bfqq_may_inject(struct bfq_queue *bfqq, struct bfq_queue *new_bfqq)
+{
+ struct bfq_data *bfqd = bfqq->bfqd;
+ bool ret = true;
+
+ if (unlikely(bfqd->better_fairness)) {
+ /*
+ * In addition to throughput, better_fairness also pays
+ * attention to Qos. In the container scenario, in order
+ * to ensure the Qos of each group we only allow tasks
+ * of the same class in the same group to be injected.
+ */
+ if (bfq_class(bfqq) != bfq_class(new_bfqq))
+ ret = false;
+
+ if (bfqq_group(bfqq) != bfqq_group(new_bfqq))
+ ret = false;
+ }
+ return ret;
+}
+
static void bfq_update_io_intensity(struct bfq_queue *bfqq, u64 now_ns)
{
u64 tot_io_time = now_ns - bfqq->io_start_time;
@@ -1985,7 +2006,8 @@ static void bfq_check_waker(struct bfq_data *bfqd, struct bfq_queue *bfqq,
bfqd->last_completed_rq_bfqq == bfqq ||
bfq_bfqq_has_short_ttime(bfqq) ||
now_ns - bfqd->last_completion >= 4 * NSEC_PER_MSEC ||
- bfqd->last_completed_rq_bfqq == bfqq->waker_bfqq)
+ bfqd->last_completed_rq_bfqq == bfqq->waker_bfqq ||
+ !bfq_bfqq_may_inject(bfqq, bfqd->last_completed_rq_bfqq))
return;

if (bfqd->last_completed_rq_bfqq !=
@@ -4415,6 +4437,9 @@ static bool bfq_bfqq_must_idle(struct bfq_queue *bfqq)
else
limit = in_serv_bfqq->inject_limit;

+ if (!bfq_bfqq_may_inject(in_serv_bfqq, bfqq))
+ continue;
+
if (bfqd->rq_in_driver < limit) {
bfqd->rqs_injected = true;
return bfqq;
@@ -4590,6 +4615,7 @@ static struct bfq_queue *bfq_select_queue(struct bfq_data *bfqd)
* happen to be served only after other queues.
*/
if (async_bfqq &&
+ !(bfqd->better_fairness && !bfq_class_idx(&bfqq->entity)) &&
icq_to_bic(async_bfqq->next_rq->elv.icq) == bfqq->bic &&
bfq_serv_to_charge(async_bfqq->next_rq, async_bfqq) <=
bfq_bfqq_budget_left(async_bfqq))
--
1.8.3.1

2021-03-25 07:00:54

by brookxu.cn

[permalink] [raw]
Subject: [PATCH v3 13/14] bfq: remove unnecessary initialization logic

From: Chunguang Xu <[email protected]>

Since we will initialize sched_data.service_tree[] in
bfq_init_root_group(), bfq_create_group_hierarchy() can
ignore this part of the initialization, which can avoid
repeated initialization.

Signed-off-by: Chunguang Xu <[email protected]>
---
block/bfq-cgroup.c | 4 ----
1 file changed, 4 deletions(-)

diff --git a/block/bfq-cgroup.c b/block/bfq-cgroup.c
index ab4bc41..05054e1 100644
--- a/block/bfq-cgroup.c
+++ b/block/bfq-cgroup.c
@@ -1514,15 +1514,11 @@ void bfqg_and_blkg_put(struct bfq_group *bfqg) {}
struct bfq_group *bfq_create_group_hierarchy(struct bfq_data *bfqd, int node)
{
struct bfq_group *bfqg;
- int i;

bfqg = kmalloc_node(sizeof(*bfqg), GFP_KERNEL | __GFP_ZERO, node);
if (!bfqg)
return NULL;

- for (i = 0; i < BFQ_IOPRIO_CLASSES; i++)
- bfqg->sched_data.service_tree[i] = BFQ_SERVICE_TREE_INIT;
-
return bfqg;
}
#endif /* CONFIG_BFQ_GROUP_IOSCHED */
--
1.8.3.1

2021-03-25 07:00:54

by brookxu.cn

[permalink] [raw]
Subject: [PATCH v3 11/14] bfq: disable idle for prio_expire under better_fairness

From: Chunguang Xu <[email protected]>

Under better_fairness, if higher priority queue is waiting
for service,disable queue idle, so that a schedule can be
invoked in time. In addition to CLASS_IDLE, other queues
allow idle, so that we can better control buffer IO too.

Signed-off-by: Chunguang Xu <[email protected]>
---
block/bfq-iosched.c | 11 +++++++++++
1 file changed, 11 insertions(+)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index be5b1e3..5aa9c2c 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -4307,6 +4307,14 @@ static bool bfq_better_to_idle(struct bfq_queue *bfqq)
return true;

/*
+ * In better_fairness mode, we also put emphasis on Qos. The main
+ * purpose of allowing idle here is to ensure better isolation
+ * of Buffer IO.
+ */
+ if (unlikely(bfqd->better_fairness))
+ return !(bfqd->bfq_slice_idle == 0 || bfq_class_idle(bfqq));
+
+ /*
* Idling is performed only if slice_idle > 0. In addition, we
* do not idle if
* (a) bfqq is async
@@ -4318,6 +4326,9 @@ static bool bfq_better_to_idle(struct bfq_queue *bfqq)
bfq_class_idle(bfqq))
return false;

+ if (bfq_may_expire_in_serv_for_prio(&bfqq->entity))
+ return false;
+
idling_boosts_thr_with_no_issue =
idling_boosts_thr_without_issues(bfqd, bfqq);

--
1.8.3.1

2021-03-25 07:01:03

by brookxu.cn

[permalink] [raw]
Subject: [PATCH v3 14/14] bfq: optimize the calculation of bfq_weight_to_ioprio()

From: Chunguang Xu <[email protected]>

The value range of ioprio is [0, 7], but the result of
bfq_weight_to_ioprio() may exceed this range, so simple
optimization is required.

Signed-off-by: Chunguang Xu <[email protected]>
---
block/bfq-wf2q.c | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/block/bfq-wf2q.c b/block/bfq-wf2q.c
index b477a9b..1b91c8b 100644
--- a/block/bfq-wf2q.c
+++ b/block/bfq-wf2q.c
@@ -586,8 +586,9 @@ unsigned short bfq_ioprio_to_weight(int ioprio)
*/
static unsigned short bfq_weight_to_ioprio(int weight)
{
- return max_t(int, 0,
- IOPRIO_BE_NR * BFQ_WEIGHT_CONVERSION_COEFF - weight);
+ int ioprio = IOPRIO_BE_NR - weight / BFQ_WEIGHT_CONVERSION_COEFF;
+
+ return ioprio < 0 ? 0 : min_t(int, ioprio, IOPRIO_BE_NR - 1);
}

static void bfq_get_entity(struct bfq_entity *entity)
--
1.8.3.1

2021-03-25 07:01:14

by brookxu.cn

[permalink] [raw]
Subject: [PATCH v3 08/14] bfq: introduce prio_expire flag for bfq_queue

From: Chunguang Xu <[email protected]>

When in_service_queue needs to be preempted by task with
a higher priority, we will mark it with prio_expire flag,
and then expire it on the IO dispatch path. Here add
prio_expire flag only.

Signed-off-by: Chunguang Xu <[email protected]>
---
block/bfq-iosched.c | 2 ++
block/bfq-iosched.h | 2 ++
2 files changed, 4 insertions(+)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index e7bc5e2..6e19b5a 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -153,6 +153,7 @@
BFQ_BFQQ_FNS(wait_request);
BFQ_BFQQ_FNS(non_blocking_wait_rq);
BFQ_BFQQ_FNS(fifo_expire);
+BFQ_BFQQ_FNS(prio_expire);
BFQ_BFQQ_FNS(has_short_ttime);
BFQ_BFQQ_FNS(sync);
BFQ_BFQQ_FNS(IO_bound);
@@ -2986,6 +2987,7 @@ static void __bfq_set_in_service_queue(struct bfq_data *bfqd,
{
if (bfqq) {
bfq_clear_bfqq_fifo_expire(bfqq);
+ bfq_clear_bfqq_prio_expire(bfqq);

bfqd->budgets_assigned = (bfqd->budgets_assigned * 7 + 256) / 8;

diff --git a/block/bfq-iosched.h b/block/bfq-iosched.h
index 674de8b..8af5ac0 100644
--- a/block/bfq-iosched.h
+++ b/block/bfq-iosched.h
@@ -777,6 +777,7 @@ enum bfqq_state_flags {
* without idling the device
*/
BFQQF_fifo_expire, /* FIFO checked in this slice */
+ BFQQF_prio_expire, /* should expire for higher prio queue*/
BFQQF_has_short_ttime, /* queue has a short think time */
BFQQF_sync, /* synchronous queue */
BFQQF_IO_bound, /*
@@ -806,6 +807,7 @@ enum bfqq_state_flags {
BFQ_BFQQ_FNS(wait_request);
BFQ_BFQQ_FNS(non_blocking_wait_rq);
BFQ_BFQQ_FNS(fifo_expire);
+BFQ_BFQQ_FNS(prio_expire);
BFQ_BFQQ_FNS(has_short_ttime);
BFQ_BFQQ_FNS(sync);
BFQ_BFQQ_FNS(IO_bound);
--
1.8.3.1

2021-03-25 07:01:43

by brookxu.cn

[permalink] [raw]
Subject: [PATCH v3 03/14] bfq: introduce bfq.ioprio for cgroup

From: Chunguang Xu <[email protected]>

Now the ioprio class of all groups is CLASS_BE, which is not very
friendly to the container scene. Therefore, we introduce the bfq.ioprio
interface to allow users to configure the ioprio class and ioprio of
the group, which can meet more priority requirements.

The bfq.ioprio interface now is available for cgroup v1 and cgroup
v2. Users can configure the ioprio for cgroup through this interface,
as shown below:

echo "1 2"> blkio.bfq.ioprio

The above two values respectively represent the values of ioprio
class and ioprio for cgroup. When necessary, we can disable this
feature by:

echo "0 0" > bfq.ioprio.

Signed-off-by: Chunguang Xu <[email protected]>
---
block/bfq-cgroup.c | 87 ++++++++++++++++++++++++++++++++++++++++++++++++++++-
block/bfq-iosched.h | 8 +++++
block/bfq-wf2q.c | 30 +++++++++++++++---
3 files changed, 119 insertions(+), 6 deletions(-)

diff --git a/block/bfq-cgroup.c b/block/bfq-cgroup.c
index 50d06c7..ab4bc41 100644
--- a/block/bfq-cgroup.c
+++ b/block/bfq-cgroup.c
@@ -489,7 +489,7 @@ static struct bfq_group_data *cpd_to_bfqgd(struct blkcg_policy_data *cpd)
return cpd ? container_of(cpd, struct bfq_group_data, pd) : NULL;
}

-static struct bfq_group_data *blkcg_to_bfqgd(struct blkcg *blkcg)
+struct bfq_group_data *blkcg_to_bfqgd(struct blkcg *blkcg)
{
return cpd_to_bfqgd(blkcg_to_cpd(blkcg, &blkcg_policy_bfq));
}
@@ -553,6 +553,16 @@ static void bfq_pd_init(struct blkg_policy_data *pd)
bfqg->bfqd = bfqd;
bfqg->active_entities = 0;
bfqg->rq_pos_tree = RB_ROOT;
+
+ bfqg->new_ioprio_class = IOPRIO_PRIO_CLASS(d->ioprio);
+ bfqg->new_ioprio = IOPRIO_PRIO_DATA(d->ioprio);
+ bfqg->ioprio_class = bfqg->new_ioprio_class;
+ bfqg->ioprio = bfqg->new_ioprio;
+
+ if (d->ioprio) {
+ entity->new_weight = bfq_ioprio_to_weight(bfqg->ioprio);
+ entity->weight = entity->new_weight;
+ }
}

static void bfq_pd_free(struct blkg_policy_data *pd)
@@ -981,6 +991,20 @@ static int bfq_io_show_weight(struct seq_file *sf, void *v)
return 0;
}

+static int bfq_io_show_ioprio(struct seq_file *sf, void *v)
+{
+ struct blkcg *blkcg = css_to_blkcg(seq_css(sf));
+ struct bfq_group_data *bfqgd = blkcg_to_bfqgd(blkcg);
+ unsigned int val = 0;
+
+ if (bfqgd)
+ val = bfqgd->ioprio;
+
+ seq_printf(sf, "%u %lu\n", IOPRIO_PRIO_CLASS(val), IOPRIO_PRIO_DATA(val));
+
+ return 0;
+}
+
static void bfq_group_set_weight(struct bfq_group *bfqg, u64 weight, u64 dev_weight)
{
weight = dev_weight ?: weight;
@@ -1098,6 +1122,55 @@ static ssize_t bfq_io_set_weight(struct kernfs_open_file *of,
return bfq_io_set_device_weight(of, buf, nbytes, off);
}

+static ssize_t bfq_io_set_ioprio(struct kernfs_open_file *of, char *buf,
+ size_t nbytes, loff_t off)
+{
+ struct cgroup_subsys_state *css = of_css(of);
+ struct blkcg *blkcg = css_to_blkcg(css);
+ struct bfq_group_data *bfqgd = blkcg_to_bfqgd(blkcg);
+ struct blkcg_gq *blkg;
+ unsigned int class, data;
+ char *endp;
+
+ buf = strstrip(buf);
+
+ class = simple_strtoul(buf, &endp, 10);
+ if (*endp != ' ')
+ return -EINVAL;
+ buf = endp + 1;
+
+ data = simple_strtoul(buf, &endp, 10);
+ if ((*endp != ' ') && (*endp != '\0'))
+ return -EINVAL;
+
+ if (class > IOPRIO_CLASS_IDLE || data >= IOPRIO_BE_NR)
+ return -EINVAL;
+
+ spin_lock_irq(&blkcg->lock);
+ bfqgd->ioprio = IOPRIO_PRIO_VALUE(class, data);
+ hlist_for_each_entry(blkg, &blkcg->blkg_list, blkcg_node) {
+ struct bfq_group *bfqg = blkg_to_bfqg(blkg);
+
+ if (bfqg) {
+ if ((bfqg->ioprio_class != class) ||
+ (bfqg->ioprio != data)) {
+ unsigned short weight;
+
+ weight = class ? bfq_ioprio_to_weight(data) :
+ BFQ_WEIGHT_LEGACY_DFL;
+
+ bfqg->new_ioprio_class = class;
+ bfqg->new_ioprio = data;
+ bfqg->entity.new_weight = weight;
+ bfqg->entity.prio_changed = 1;
+ }
+ }
+ }
+ spin_unlock_irq(&blkcg->lock);
+
+ return nbytes;
+}
+
static int bfqg_print_rwstat(struct seq_file *sf, void *v)
{
blkcg_print_blkgs(sf, css_to_blkcg(seq_css(sf)), blkg_prfill_rwstat,
@@ -1264,6 +1337,12 @@ struct cftype bfq_blkcg_legacy_files[] = {
.seq_show = bfq_io_show_weight,
.write = bfq_io_set_weight,
},
+ {
+ .name = "bfq.ioprio",
+ .flags = CFTYPE_NOT_ON_ROOT,
+ .seq_show = bfq_io_show_ioprio,
+ .write = bfq_io_set_ioprio,
+ },

/* statistics, covers only the tasks in the bfqg */
{
@@ -1384,6 +1463,12 @@ struct cftype bfq_blkg_files[] = {
.seq_show = bfq_io_show_weight,
.write = bfq_io_set_weight,
},
+ {
+ .name = "bfq.ioprio",
+ .flags = CFTYPE_NOT_ON_ROOT,
+ .seq_show = bfq_io_show_ioprio,
+ .write = bfq_io_set_ioprio,
+ },
{} /* terminate */
};

diff --git a/block/bfq-iosched.h b/block/bfq-iosched.h
index 28d8590..3416a75 100644
--- a/block/bfq-iosched.h
+++ b/block/bfq-iosched.h
@@ -867,6 +867,7 @@ struct bfq_group_data {
struct blkcg_policy_data pd;

unsigned int weight;
+ unsigned short ioprio;
};

/**
@@ -923,6 +924,11 @@ struct bfq_group {

int active_entities;

+ /* current ioprio and ioprio class */
+ unsigned short ioprio, ioprio_class;
+ /* next ioprio and ioprio class if a change is in progress */
+ unsigned short new_ioprio, new_ioprio_class;
+
struct rb_root rq_pos_tree;

struct bfqg_stats stats;
@@ -991,6 +997,7 @@ void bfq_bfqq_move(struct bfq_data *bfqd, struct bfq_queue *bfqq,
void bfq_init_entity(struct bfq_entity *entity, struct bfq_group *bfqg);
void bfq_bic_update_cgroup(struct bfq_io_cq *bic, struct bio *bio);
void bfq_end_wr_async(struct bfq_data *bfqd);
+struct bfq_group_data *blkcg_to_bfqgd(struct blkcg *blkcg);
struct bfq_group *bfq_find_set_group(struct bfq_data *bfqd,
struct blkcg *blkcg);
struct blkcg_gq *bfqg_to_blkg(struct bfq_group *bfqg);
@@ -1037,6 +1044,7 @@ struct bfq_group *bfq_find_set_group(struct bfq_data *bfqd,

struct bfq_group *bfq_bfqq_to_bfqg(struct bfq_queue *bfqq);
struct bfq_queue *bfq_entity_to_bfqq(struct bfq_entity *entity);
+unsigned int bfq_class_idx(struct bfq_entity *entity);
unsigned int bfq_tot_busy_queues(struct bfq_data *bfqd);
struct bfq_service_tree *bfq_entity_service_tree(struct bfq_entity *entity);
struct bfq_entity *bfq_entity_of(struct rb_node *node);
diff --git a/block/bfq-wf2q.c b/block/bfq-wf2q.c
index 276f225..7405be9 100644
--- a/block/bfq-wf2q.c
+++ b/block/bfq-wf2q.c
@@ -27,12 +27,21 @@ static struct bfq_entity *bfq_root_active_entity(struct rb_root *tree)
return rb_entry(node, struct bfq_entity, rb_node);
}

-static unsigned int bfq_class_idx(struct bfq_entity *entity)
+unsigned int bfq_class_idx(struct bfq_entity *entity)
{
struct bfq_queue *bfqq = bfq_entity_to_bfqq(entity);
+ unsigned short class = BFQ_DEFAULT_GRP_CLASS;

- return bfqq ? bfqq->ioprio_class - 1 :
- BFQ_DEFAULT_GRP_CLASS - 1;
+ if (bfqq)
+ class = bfqq->ioprio_class;
+#ifdef CONFIG_BFQ_GROUP_IOSCHED
+ else {
+ struct bfq_group *bfqg = bfq_entity_to_bfqg(entity);
+
+ class = bfqg->ioprio_class ?: BFQ_DEFAULT_GRP_CLASS;
+ }
+#endif
+ return class - 1;
}

unsigned int bfq_tot_busy_queues(struct bfq_data *bfqd)
@@ -767,14 +776,25 @@ struct bfq_service_tree *
bfq_weight_to_ioprio(entity->orig_weight);
}

- if (bfqq && update_class_too)
- bfqq->ioprio_class = bfqq->new_ioprio_class;
+ if (update_class_too) {
+ if (bfqq)
+ bfqq->ioprio_class = bfqq->new_ioprio_class;
+#ifdef CONFIG_BFQ_GROUP_IOSCHED
+ else
+ bfqg->ioprio_class = bfqg->new_ioprio_class;
+#endif
+ }

/*
* Reset prio_changed only if the ioprio_class change
* is not pending any longer.
*/
+#ifdef CONFIG_BFQ_GROUP_IOSCHED
+ if ((bfqq && bfqq->ioprio_class == bfqq->new_ioprio_class) ||
+ (!bfqq && bfqg->ioprio_class == bfqg->new_ioprio_class))
+#else
if (!bfqq || bfqq->ioprio_class == bfqq->new_ioprio_class)
+#endif
entity->prio_changed = 0;

/*
--
1.8.3.1

2021-03-25 07:02:26

by brookxu.cn

[permalink] [raw]
Subject: [PATCH v3 09/14] bfq: expire in_serv_queue for prio_expire under better_fairness

From: Chunguang Xu <[email protected]>

Traverse all schedule domains upward, if there are higher
priority tasks waiting for service, mark in_service_queue
prio_expire and then expire it, so the So RT tasks can be
scheduled in time.

Signed-off-by: Chunguang Xu <[email protected]>
---
block/bfq-iosched.c | 7 +++----
block/bfq-iosched.h | 1 +
block/bfq-wf2q.c | 60 +++++++++++++++++++++++++++++++++++++++++++++++++++++
3 files changed, 64 insertions(+), 4 deletions(-)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index 6e19b5a..51192bd 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -4736,10 +4736,9 @@ static struct request *bfq_dispatch_rq_from_bfqq(struct bfq_data *bfqd,
* belongs to CLASS_IDLE and other queues are waiting for
* service.
*/
- if (!(bfq_tot_busy_queues(bfqd) > 1 && bfq_class_idle(bfqq)))
- goto return_rq;
-
- bfq_bfqq_expire(bfqd, bfqq, false, BFQQE_BUDGET_EXHAUSTED);
+ if ((bfq_tot_busy_queues(bfqd) > 1 && bfq_class_idle(bfqq)) ||
+ bfq_bfqq_prio_expire(bfqq))
+ bfq_bfqq_expire(bfqd, bfqq, false, BFQQE_BUDGET_EXHAUSTED);

return_rq:
return rq;
diff --git a/block/bfq-iosched.h b/block/bfq-iosched.h
index 8af5ac0..1406398 100644
--- a/block/bfq-iosched.h
+++ b/block/bfq-iosched.h
@@ -989,6 +989,7 @@ void bfq_bfqq_expire(struct bfq_data *bfqd, struct bfq_queue *bfqq,
void bfq_release_process_ref(struct bfq_data *bfqd, struct bfq_queue *bfqq);
void bfq_schedule_dispatch(struct bfq_data *bfqd);
void bfq_put_async_queues(struct bfq_data *bfqd, struct bfq_group *bfqg);
+bool bfq_may_expire_in_serv_for_prio(struct bfq_entity *entity);

/* ------------ end of main algorithm interface -------------- */

diff --git a/block/bfq-wf2q.c b/block/bfq-wf2q.c
index 1f8f3c5..b477a9b 100644
--- a/block/bfq-wf2q.c
+++ b/block/bfq-wf2q.c
@@ -161,6 +161,51 @@ struct bfq_group *bfq_bfqq_to_bfqg(struct bfq_queue *bfqq)
return bfq_entity_to_bfqg(group_entity);
}

+bool bfq_may_expire_in_serv_for_prio(struct bfq_entity *entity)
+{
+ struct bfq_sched_data *sd;
+ struct bfq_queue *bfqq;
+ struct bfq_group *bfqg;
+ bool ret = false;
+
+ sd = entity->sched_data;
+ bfqg = container_of(sd, struct bfq_group, sched_data);
+
+ if (likely(!bfqg->bfqd->better_fairness))
+ return false;
+
+ bfqq = bfqg->bfqd->in_service_queue;
+ if (bfqq) {
+ struct bfq_entity *next_in_serv;
+
+ /*
+ * Traverse the upper-level scheduling domain for
+ * prio preemption, and expire in_service_queue
+ * if necessary.
+ */
+ entity = &bfqq->entity;
+ for_each_entity(entity) {
+ sd = entity->sched_data;
+ next_in_serv = sd->next_in_service;
+
+ if (!next_in_serv)
+ continue;
+
+ /*
+ * Expire bfqq, if next_in_serv belongs to
+ * a higher class.
+ */
+ if (bfq_class_idx(next_in_serv) <
+ bfq_class_idx(entity)) {
+ bfq_mark_bfqq_prio_expire(bfqq);
+ ret = true;
+ break;
+ }
+ }
+ }
+ return ret;
+}
+
/*
* Returns true if this budget changes may let next_in_service->parent
* become the next_in_service entity for its parent entity.
@@ -244,6 +289,11 @@ struct bfq_group *bfq_bfqq_to_bfqg(struct bfq_queue *bfqq)
return bfqq->bfqd->root_group;
}

+bool bfq_may_expire_in_serv_for_prio(struct bfq_entity *entity)
+{
+ return false;
+}
+
static bool bfq_update_parent_budget(struct bfq_entity *next_in_service)
{
return false;
@@ -1162,6 +1212,7 @@ static void bfq_activate_requeue_entity(struct bfq_entity *entity,
bool non_blocking_wait_rq,
bool requeue, bool expiration)
{
+ struct bfq_entity *old_entity = entity;
struct bfq_sched_data *sd;

for_each_entity(entity) {
@@ -1172,6 +1223,15 @@ static void bfq_activate_requeue_entity(struct bfq_entity *entity,
!requeue)
break;
}
+
+ /*
+ * Expire in_service_queue, if a task belongs to higher class
+ * is added to the upper-level scheduling domain, we should
+ * initiate a new schedule. But here is just to mark bfqq
+ * prio_expire, the real schedule occurs in
+ * bfq_dispatch_rq_from_bfqq().
+ */
+ bfq_may_expire_in_serv_for_prio(old_entity);
}

/**
--
1.8.3.1

2021-03-25 07:02:51

by brookxu.cn

[permalink] [raw]
Subject: [PATCH v3 12/14] bfq: disable merging between different groups under better_fairness

From: Chunguang Xu <[email protected]>

In order to better guarantee the Qos for each group, we do not
allow queues of different groups to be merged.

Signed-off-by: Chunguang Xu <[email protected]>
---
block/bfq-iosched.c | 3 +++
1 file changed, 3 insertions(+)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index 5aa9c2c..f4a99f9 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -2665,6 +2665,9 @@ static bool bfq_may_be_close_cooperator(struct bfq_queue *bfqq,
if (!bfq_bfqq_sync(bfqq) || !bfq_bfqq_sync(new_bfqq))
return false;

+ if (!bfq_bfqq_may_inject(bfqq, new_bfqq))
+ return false;
+
return true;
}

--
1.8.3.1

2021-04-04 16:11:31

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH v3 00/14] bfq: introduce bfq.ioprio for cgroup

Hello,

On Thu, Mar 25, 2021 at 02:57:44PM +0800, brookxu wrote:
> INTERFACE:
>
> The bfq.ioprio interface now is available for cgroup v1 and cgroup
> v2. Users can configure the ioprio for cgroup through this
> interface, as shown below:
>
> echo "1 2"> blkio.bfq.ioprio
>
> The above two values respectively represent the values of ioprio
> class and ioprio for cgroup.
>
> EXPERIMENT:
>
> The test process is as follows:
> # prepare data disk
> mount /dev/sdb /data1
>
> # prepare IO scheduler
> echo bfq > /sys/block/sdb/queue/scheduler
> echo 0 > /sys/block/sdb/queue/iosched/low_latency
> echo 1 > /sys/block/sdb/queue/iosched/better_fairness
>
> It is worth noting here that nr_requests limits the number of
> requests, and it does not perceive priority. If nr_requests is
> too small, it may cause a serious priority inversion problem.
> Therefore, we can increase the size of nr_requests based on
> the actual situation.
>
> # create cgroup v1 hierarchy
> cd /sys/fs/cgroup/blkio
> mkdir rt be0 be1 be2 idle
>
> # prepare cgroup
> echo "1 0" > rt/blkio.bfq.ioprio
> echo "2 0" > be0/blkio.bfq.ioprio
> echo "2 4" > be1/blkio.bfq.ioprio
> echo "2 7" > be2/blkio.bfq.ioprio
> echo "3 0" > idle/blkio.bfq.ioprio

Here are some concerns:

* The main benefit of bfq compared to cfq at least was that the behavior
model was defined in a clearer way. It was possible to describe what the
control model was in a way which makes semantic sense. The main problem I
see with this proposal is that it's an interface which grew out of the
current implementation specifics and I'm having a hard time understanding
what the end results should be with different configuration combinations.

* While this might work around some scheduling latency issues but I have a
hard time imagining it being able to address actual QoS issues. e.g. on a
lot of SSDs, without absolute throttling, device side latencies can spike
by multiple orders of magnitude and no prioritization on the scheduler
side is gonna help once such state is reached. Here, there's no robust
mechanisms or measurement/control units defined to address that. In fact,
the above direction to increase nr_requests limit will make priority
inversions on the device and post-elevator side way more likely and
severe.

So, maybe it helps with specific scenarios on some hardware, but given the
ad-hoc nature, I don't think it justifies all the extra interface additions.
My suggestion would be slimming it down to bare essentials and making the
user interface part as minimal as possible.

Thanks.

--
tejun

2021-04-06 16:40:55

by brookxu.cn

[permalink] [raw]
Subject: Re: [PATCH v3 00/14] bfq: introduce bfq.ioprio for cgroup



Tejun Heo wrote on 2021/4/5 0:09:
> Hello,

Hi, tj, thanks for your reply:)

> On Thu, Mar 25, 2021 at 02:57:44PM +0800, brookxu wrote:
>> INTERFACE:
>>
>> The bfq.ioprio interface now is available for cgroup v1 and cgroup
>> v2. Users can configure the ioprio for cgroup through this
>> interface, as shown below:
>>
>> echo "1 2"> blkio.bfq.ioprio
>>
>> The above two values respectively represent the values of ioprio
>> class and ioprio for cgroup.
>>
>> EXPERIMENT:
>>
>> The test process is as follows:
>> # prepare data disk
>> mount /dev/sdb /data1
>>
>> # prepare IO scheduler
>> echo bfq > /sys/block/sdb/queue/scheduler
>> echo 0 > /sys/block/sdb/queue/iosched/low_latency
>> echo 1 > /sys/block/sdb/queue/iosched/better_fairness
>>
>> It is worth noting here that nr_requests limits the number of
>> requests, and it does not perceive priority. If nr_requests is
>> too small, it may cause a serious priority inversion problem.
>> Therefore, we can increase the size of nr_requests based on
>> the actual situation.
>>
>> # create cgroup v1 hierarchy
>> cd /sys/fs/cgroup/blkio
>> mkdir rt be0 be1 be2 idle
>>
>> # prepare cgroup
>> echo "1 0" > rt/blkio.bfq.ioprio
>> echo "2 0" > be0/blkio.bfq.ioprio
>> echo "2 4" > be1/blkio.bfq.ioprio
>> echo "2 7" > be2/blkio.bfq.ioprio
>> echo "3 0" > idle/blkio.bfq.ioprio
>
> Here are some concerns:
>
> * The main benefit of bfq compared to cfq at least was that the behavior
> model was defined in a clearer way. It was possible to describe what the
> control model was in a way which makes semantic sense. The main problem I
> see with this proposal is that it's an interface which grew out of the
> current implementation specifics and I'm having a hard time understanding
> what the end results should be with different configuration combinations.

In the current scheduling strategy, we consider both the entity's ioprio class
and budget size. But in fact, there are some differences between bfqq and bfqg.
Since the ioprio class of bfqg is fixed to BE, the scheduling of bfqg actually
only considers the budget size. The introduction of ioprio for cgroup should not
destroy or complicate the existing design of bfq. It followed the original design
of bfq and try to make us thinking about the scheduling of entities more simply,
without distinguishing between bfqq and bfqg.

> * While this might work around some scheduling latency issues but I have a
> hard time imagining it being able to address actual QoS issues. e.g. on a
> lot of SSDs, without absolute throttling, device side latencies can spike
> by multiple orders of magnitude and no prioritization on the scheduler
> side is gonna help once such state is reached. Here, there's no robust
> mechanisms or measurement/control units defined to address that. In fact,

The latency caused by ssd fireware operation is unpredictable. Here we try to
control Qos under normal conditions, which usually meets most scenarios. In the
container scenario, in addition to the overall IO Qos control of the container,
we also hope to achieve more fine-grained Qos control of the tasks inside the
container, such as ioprio support, suppression of async IO, and so on.

> the above direction to increase nr_requests limit will make priority
> inversions on the device and post-elevator side way more likely and
> severe.

Increasing nr_request is really not a good way. I tried to reserve 10% of tags
for in service group by limit depth, which can better alleviate this problem,
but more tests are needed.

> So, maybe it helps with specific scenarios on some hardware, but given the
> ad-hoc nature, I don't think it justifies all the extra interface additions.
> My suggestion would be slimming it down to bare essentials and making the
> user interface part as minimal as possible.

Now the weight of bfqq is jointly determined by ioprio and weight, and both
ioprio and weight will update entity.weight. After the introduction of bfq.ioprio
for cgroup, the processing of bfqg is the same as that of bfqq, and the complexity
is not increased from the perspective of entity. There is no new concept added to
the user side, because the per task ioprio has existed for a long time.

> Thanks.
>